{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "d6Z2dcsaAED9"
   },
   "source": [
    "#  pandas-style queries in FiftyOne"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wMYFG6Pd7-B_"
   },
   "source": [
    "## Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_l4QlRm58UlV"
   },
   "source": [
    "[pandas](https://pypi.org/project/pandas/) is a Python library for data analysis. The central object in pandas is a `DataFrame`, which is a two-dimensional labeled data structure that handles tabular data. pandas is optimized for storing, manipulating, and analyzing tabular data, making it useful for a wide variety of data science, data engineering, and machine learning tasks.\n",
    "\n",
    "[FiftyOne](https://voxel51.com/docs/fiftyone/), is an open-source Python library for building high-quality datasets and computer vision models. The central object in FiftyOne is the `Dataset`, which allows for efficient handling of datasets consisting of images, videos, geospatial, or 3D data, as well as the corresponding metadata and labels associated with the media (which are often more complex than what can be represented in a two-dimensional data structure).\n",
    "\n",
    "While they apply to different types of data, the pandas `DataFrame` and FiftyOne `Dataset` classes share many similar functionalities. In this overview, we'll present a side-by-side comparison of common operations in the two libraries.\n",
    "\n",
    "If you're already a pandas power user, then you'll be a FiftyOne power user too after running through this tutorial!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sw6d1szg-gJq"
   },
   "source": [
    "## Getting started"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zD7e8xHcA5q7"
   },
   "source": [
    "The first thing to do is to install FiftyOne:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "9C4zazKdASnp",
    "outputId": "b1f28c7b-9fb9-46fb-af57-d4cce70d183f"
   },
   "outputs": [],
   "source": [
    "!pip install fiftyone"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_lEJDMgiBDPI"
   },
   "source": [
    "Then we will import pandas and FiftyOne:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "0yaGKWbMAWSB",
    "outputId": "ebfab18b-ff20-4ba4-96b8-6ffbc2d6b1ed"
   },
   "outputs": [],
   "source": [
    "import fiftyone as fo\n",
    "import fiftyone.zoo as foz\n",
    "from fiftyone import ViewField as F  # For handling expressions in matching and filtering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "id": "ArD7Sg30AjxC"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hzo-63jVArZm"
   },
   "source": [
    "In this tutorial, we will download example data for illustrative purposes. Before doing so, we demonstrate how to create empty `pd.DataFrame` and `fo.Dataset` objects"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hZTIMC21CHOo"
   },
   "source": [
    "### Create empty"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "36jK5JCbDB5m"
   },
   "source": [
    "#### Create empty `pd.DataFrame`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "id": "TOV7sB9WCEXU"
   },
   "outputs": [],
   "source": [
    "empty_df = pd.DataFrame()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WnJkLELeCQAW"
   },
   "source": [
    "we can get basic information about the `DataFrame` using the [info](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.info) property:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "wHqWTmjUCMqZ",
    "outputId": "aa5bbac1-afd9-4da8-b0a8-557ef72cd931"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<bound method DataFrame.info of Empty DataFrame\n",
       "Columns: []\n",
       "Index: []>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "empty_df.info"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "uQrar50OCeH-"
   },
   "source": [
    "We can also give the `DataFrame` object a name:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "id": "AVtemu7eCb82"
   },
   "outputs": [],
   "source": [
    "empty_df.name = 'empty_df'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "l17Wj27YDIiz"
   },
   "source": [
    "#### Create empty `fo.Dataset`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ln3DwnPbD-yI"
   },
   "source": [
    "We can similarly create a `Dataset` object by calling the FiftyOne core [fo.Dataset()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset) method without any arguments:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "id": "Brd7yR3XCuZo"
   },
   "outputs": [],
   "source": [
    "empty_dataset = fo.Dataset()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "djhxWD0fEYFp"
   },
   "source": [
    "We can get basic info about the `Dataset` object using `print`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "oYK9KhWEEjLS",
    "outputId": "0f01dbfa-a127-4fbd-8042-d12b0410560c"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Name:        2022.11.18.18.14.41\n",
      "Media type:  None\n",
      "Num samples: 0\n",
      "Persistent:  False\n",
      "Tags:        []\n",
      "Sample fields:\n",
      "    id:       fiftyone.core.fields.ObjectIdField\n",
      "    filepath: fiftyone.core.fields.StringField\n",
      "    tags:     fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
      "    metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n"
     ]
    }
   ],
   "source": [
    "print(empty_dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NALNwUbpENhF"
   },
   "source": [
    "We can see a few things:\n",
    "1. Calling the `fo.DataFrame()` method without an input name resulted in a name being autogenerated based on the time of creation.\n",
    "2. Whereas the empty Pandas `DataFrame` has a (trivial) `Index`, the initialized FiftyOne `Dataset` has empty `Tags` (accessible via `dataset.tags`), and each entry - called a `Sample`, has predefined fields, including `id` and `filepath`. These are necessary for properly accessing and addressing the samples, as the `Dataset` stores pointers to the media files, not the media objects themselves."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Bl-UF4LgwJlE"
   },
   "source": [
    "If we wanted to name an existing `Dataset`, we could do so in analogous fashion to pandas:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "id": "lOchiFTAwSIW"
   },
   "outputs": [],
   "source": [
    "empty_dataset.name = \"empty-dataset\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "g66sAuHFLmP2",
    "outputId": "73543550-bbb0-49f8-9136-7ac6fa6941d9"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Name:        empty-dataset\n",
      "Media type:  None\n",
      "Num samples: 0\n",
      "Persistent:  False\n",
      "Tags:        []\n",
      "Sample fields:\n",
      "    id:       fiftyone.core.fields.ObjectIdField\n",
      "    filepath: fiftyone.core.fields.StringField\n",
      "    tags:     fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
      "    metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n"
     ]
    }
   ],
   "source": [
    "print(empty_dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-1Y1zHCVMm57"
   },
   "source": [
    "Alternatively, if we want to initialize the dataset with a name, we can pass a name in:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "id": "kBwBalwBCxL7"
   },
   "outputs": [],
   "source": [
    "empty_dataset = fo.Dataset('empty-ds')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ehR2QMmbNRhO"
   },
   "source": [
    "### Example data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QFIMXd1DMzco"
   },
   "source": [
    "For the rest of this tutorial, we will use the following example data:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "V0AUY9QzNw0S"
   },
   "source": [
    "#### [Iris Dataset](https://archive.ics.uci.edu/ml/datasets/iris)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "id": "v9fsq6IFNrok"
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "LoRRnmVaODnY",
    "outputId": "da9e07d8-9fe2-4090-cb2d-48a963d033cd"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 150 entries, 0 to 149\n",
      "Data columns (total 5 columns):\n",
      " #   Column        Non-Null Count  Dtype  \n",
      "---  ------        --------------  -----  \n",
      " 0   sepal_length  150 non-null    float64\n",
      " 1   sepal_width   150 non-null    float64\n",
      " 2   petal_length  150 non-null    float64\n",
      " 3   petal_width   150 non-null    float64\n",
      " 4   species       150 non-null    object \n",
      "dtypes: float64(4), object(1)\n",
      "memory usage: 6.0+ KB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Rnh6o1wyNvf2",
    "outputId": "038b29dc-bbf4-436e-a708-b0c830eede52"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',\n",
       "       'species'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5RmCwqPxOTLB"
   },
   "source": [
    "#### [FiftyOne Quickstart Data](https://github.com/voxel51/fiftyone-examples/blob/master/examples/quickstart.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "VDgVDMPIOJwc",
    "outputId": "0ad7a1af-f7c8-49c8-81a0-91c58bb36c91"
   },
   "outputs": [],
   "source": [
    "ds = foz.load_zoo_dataset(\"quickstart\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "WxVICdpPOfkD",
    "outputId": "52d2986f-0fe6-4bea-eb7c-be36c7cd1d5d"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Name:        quickstart\n",
      "Media type:  image\n",
      "Num samples: 200\n",
      "Persistent:  True\n",
      "Tags:        []\n",
      "Sample fields:\n",
      "    id:              fiftyone.core.fields.ObjectIdField\n",
      "    filepath:        fiftyone.core.fields.StringField\n",
      "    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
      "    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)\n",
      "    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
      "    uniqueness:      fiftyone.core.fields.FloatField\n",
      "    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
      "    eval_tp:         fiftyone.core.fields.IntField\n",
      "    eval_fp:         fiftyone.core.fields.IntField\n",
      "    eval_fn:         fiftyone.core.fields.IntField\n",
      "    abstractness:    fiftyone.core.fields.FloatField\n",
      "    new_const_field: fiftyone.core.fields.IntField\n",
      "    computed_field:  fiftyone.core.fields.IntField\n"
     ]
    }
   ],
   "source": [
    "print(ds)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MG5C23lKOjtG"
   },
   "source": [
    "## Basics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "YgShu494SCKI"
   },
   "source": [
    "### Head and tail"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xOLMysGISFm3"
   },
   "source": [
    "To start to get a feel for the data, we might want to inspect a few entries. For instance, we might want to look at the first few entries, or the last few entries. In both pandas and FiftyOne, these can be accomplished with the [head()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.head) and [tail()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.tail) methods, which have identical syntax."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nWT1Xp-ISlgY"
   },
   "source": [
    "#### Head"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "-J1YruIiSn8s",
    "outputId": "9b2172d1-fadd-4299-e0cc-ad1c3693b0ca"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal_length  sepal_width  petal_length  petal_width species\n",
       "0           5.1          3.5           1.4          0.2  setosa\n",
       "1           4.9          3.0           1.4          0.2  setosa\n",
       "2           4.7          3.2           1.3          0.2  setosa\n",
       "3           4.6          3.1           1.5          0.2  setosa\n",
       "4           5.0          3.6           1.4          0.2  setosa"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "DJJLvkeWSx2S",
    "outputId": "7d053745-940a-410c-dfc7-b3ff4725cab6"
   },
   "outputs": [],
   "source": [
    "first_few_samples = ds.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4_R1gl2BS8yl"
   },
   "source": [
    "Running `DataFrame.head(n)` for instance returns the first $n$ *rows* of the original `DataFrame`. Running `Dataset.head(5)` for instance returns the first five *samples* of the original `Dataset`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "AlVdORLLPCy5"
   },
   "source": [
    "In a pandas `DataFrame`, two-dimensional tabular data is represented in *rows* and *columns*. \n",
    "\n",
    "Analogously, a FiftyOne `Dataset` consists of *samples* and *fields*. More explicitly:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UWJJfF0IOr1n"
   },
   "source": [
    "| Pandas DataFrame | FiftyOne Dataset    |\n",
    "|    :----:   |          ---: |\n",
    "|       Row       | Sample   |\n",
    "| Column        | Field      |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GH7vwlCjO2_x"
   },
   "source": [
    "In pandas, we expect that a fixed set of columns, each representing a different feature, suffices to represent the data. Some rows might not have values for each column, but each row has the same schema. This is ideal for dealing with a wide variety of data, from housing prices to time series predictions.\n",
    "\n",
    "FiftyOne is built for dealing with the unstructured data often encountered in computer vision applications. As such, a FiftyOne `Dataset` does not assume such a uniform schema. In this example, `ds` let's consider the field `predictions`. This field consists of a list of `Detection` objects, each of which has its own label, bounding box, and confidence score. These represent a model's predictions for detected objects in the image corresponding to the sample. Not all images are guaranteed to contain the same number of predicted objects, so it is preferable for samples to be more flexible than the rows in a `DataFrame`!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "h6dVoE7jVnGx"
   },
   "source": [
    "#### Tail"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nc3AAYvvVxNP"
   },
   "source": [
    "To get the last $n$ entries (rows or samples), we can use the `tail(n)` method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "DNtAod6MVsJK",
    "outputId": "dee43154-3078-4c86-c91f-2bcd3214dbe2"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>145</th>\n",
       "      <td>6.7</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.2</td>\n",
       "      <td>2.3</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146</th>\n",
       "      <td>6.3</td>\n",
       "      <td>2.5</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1.9</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>147</th>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.2</td>\n",
       "      <td>2.0</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>148</th>\n",
       "      <td>6.2</td>\n",
       "      <td>3.4</td>\n",
       "      <td>5.4</td>\n",
       "      <td>2.3</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149</th>\n",
       "      <td>5.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.1</td>\n",
       "      <td>1.8</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     sepal_length  sepal_width  petal_length  petal_width    species\n",
       "145           6.7          3.0           5.2          2.3  virginica\n",
       "146           6.3          2.5           5.0          1.9  virginica\n",
       "147           6.5          3.0           5.2          2.0  virginica\n",
       "148           6.2          3.4           5.4          2.3  virginica\n",
       "149           5.9          3.0           5.1          1.8  virginica"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.tail(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "KzloDibTVtmd",
    "outputId": "0075fc5e-f4d2-475b-b56f-403fcdd664f1"
   },
   "outputs": [],
   "source": [
    "last_few_samples = ds.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_HqExn6kZfTs"
   },
   "source": [
    "### First and last"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hoJwhMC_ZkOJ"
   },
   "source": [
    "If we only want the first sample in a `Dataset`, we can use the [first()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.first) method, which is equivalent to `ds.head()[0]`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "id": "eWYJAxWgZ52q"
   },
   "outputs": [],
   "source": [
    "first_sample = ds.first()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2NO4lqRuZ9Y-"
   },
   "source": [
    "Similarly, if we only want the last sample, we can use the [last()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.last) method, which is equivalent to `ds.tail()[0]`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "id": "N-rlhzX-aTzC"
   },
   "outputs": [],
   "source": [
    "last_sample = ds.last()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JxRrY0V2fnfq"
   },
   "source": [
    "### Get single element"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9y7r_jADfrra"
   },
   "source": [
    "In pandas, if we want to get the element at index $j$ in a `DataFrame`, we can employ the `loc[j]` or `iloc[j]` functionality, depending on our usage. For instance,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "id": "ru6ACxIFztn3"
   },
   "outputs": [],
   "source": [
    "j = 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "DbUrRFqXfrRE",
    "outputId": "585b1243-c6af-47b4-ec98-88107c2585de"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sepal_length       5.4\n",
       "sepal_width        3.7\n",
       "petal_length       1.5\n",
       "petal_width        0.2\n",
       "species         setosa\n",
       "Name: 10, dtype: object"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[j]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pt1KwRu5frMK"
   },
   "source": [
    "In FiftyOne, we can achieve the same functionality of picking out the $j^{th}$ sample by running:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "id": "CWaB4JjFzWwH"
   },
   "outputs": [],
   "source": [
    "sample = ds.skip(j).first()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0pXHmsKe0ERv"
   },
   "source": [
    "However, in many cases, one is more interested in extracting samples based on their sample id or filepath. In these cases, the syntactical sugar mirrors pandas: both `sample = ds[id]` and `sample = ds[filepath]` achieve the desired result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "NUqCFPuyfqWK",
    "outputId": "ff1ae967-7d19-4670-c830-c7aa532e62fe"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True\n"
     ]
    }
   ],
   "source": [
    "filepath = sample.filepath\n",
    "print(ds[filepath].id == sample.id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Ol_5-N3RaYLU"
   },
   "source": [
    "### Number of rows/samples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "l-7fCe61ctEv"
   },
   "source": [
    "We can get the number of samples in a `fo.Dataset` just the same as we would get the number of rows in a `pd.DataFrame` object - by passing it to Python's `len()` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Qk3OPFcqaYuk",
    "outputId": "955375d5-ecdd-40af-9a19-e3bf7bca6fbb"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "150"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "XTRp2ILXaY1f",
    "outputId": "eacbf509-00f6-4688-b784-60ca33b627f1"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "200"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(ds)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cfDS8X4ZhxYc"
   },
   "source": [
    "There are $150$ flowers in the Iris dataset, and $200$ images in our FiftyOne Quickstart dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "P6OlZAMqaY7T"
   },
   "source": [
    "### Getting columns/field schema"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Y0q0EWTXdr2P"
   },
   "source": [
    "In pandas, where all rows in a `DataFrame` share the same columns, we can get the names of the columns with the `DataFrame.columns` property."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "bhVw0JkyaZA5",
    "outputId": "c3fd338a-8643-40f0-a57b-41c53774fd9d"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',\n",
       "       'species'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "D_qI7WpId7zv"
   },
   "source": [
    "In FiftyOne, the core field schema is shared among samples, but the structure within these first-level fields can vary. We can get the field schema by calling the [get_field_schema()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.get_field_schema) method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "ks-PPZilaZGF",
    "outputId": "a3843273-f8de-4815-fa35-68198c2d803d"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "OrderedDict([('id', <fiftyone.core.fields.ObjectIdField at 0x2a0a65a90>),\n",
       "             ('filepath', <fiftyone.core.fields.StringField at 0x2a0a5b2b0>),\n",
       "             ('tags', <fiftyone.core.fields.ListField at 0x2a0a8c460>),\n",
       "             ('metadata',\n",
       "              <fiftyone.core.fields.EmbeddedDocumentField at 0x2a0a8c100>),\n",
       "             ('ground_truth',\n",
       "              <fiftyone.core.fields.EmbeddedDocumentField at 0x2a0a651f0>),\n",
       "             ('uniqueness', <fiftyone.core.fields.FloatField at 0x2a0a8cd90>),\n",
       "             ('predictions',\n",
       "              <fiftyone.core.fields.EmbeddedDocumentField at 0x2a0a8c1f0>),\n",
       "             ('eval_tp', <fiftyone.core.fields.IntField at 0x2a0a8cf40>),\n",
       "             ('eval_fp', <fiftyone.core.fields.IntField at 0x2a0a8cf70>),\n",
       "             ('eval_fn', <fiftyone.core.fields.IntField at 0x2a0a78550>),\n",
       "             ('abstractness',\n",
       "              <fiftyone.core.fields.FloatField at 0x2a0a78580>),\n",
       "             ('new_const_field',\n",
       "              <fiftyone.core.fields.IntField at 0x2a0a785b0>),\n",
       "             ('computed_field',\n",
       "              <fiftyone.core.fields.IntField at 0x2a0a785e0>)])"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds.get_field_schema()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LqOI1PLjx4Ir"
   },
   "source": [
    "In video tasks, `get_field_schema` is replaced by [get_frame_field_schema()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.get_frame_field_schema)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "oGs6beLC-mHo"
   },
   "source": [
    "Some of the field types, such as [FloatField](https://voxel51.com/docs/fiftyone/api/fiftyone.core.fields.html#fiftyone.core.fields.FloatField) (float) and [StringField](https://voxel51.com/docs/fiftyone/api/fiftyone.core.fields.html#fiftyone.core.fields.StringField) (string) correspond in straightforward fashion to data types in pandas, or in Python more generally. As we will see below, the [EmbeddedDocumentField](https://voxel51.com/docs/fiftyone/api/fiftyone.core.fields.html#fiftyone.core.fields.EmbeddedDocumentField), which does not have a perfect analog in pandas, is part of what gives the FiftyOne `Dataset` its powerful flexibility for tackling computer vision tasks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RUUNgWl1hhho"
   },
   "source": [
    "If we just want the field names for all samples in the dataset, you can do the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Tn5Nrr17hh0z",
    "outputId": "bd496262-54b4-48d6-dbae-8d6662fe950e"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['id', 'filepath', 'tags', 'metadata', 'ground_truth', 'uniqueness', 'predictions', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field']\n"
     ]
    }
   ],
   "source": [
    "field_names = list(ds.get_field_schema().keys())\n",
    "print(field_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mtFulgpltxiq"
   },
   "source": [
    "### All values in a column/field"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "K_FhFlaZwNZx"
   },
   "source": [
    "In pandas, the entries in each column or `pd.Series` object must themselves be objects of the type of one of the numpy data types. Thus, when all of the values in a column are extracted, the resulting list will have depth one:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "SeKy5Vdxt5kW",
    "outputId": "479a7946-d1c8-4fb9-f371-d8c33b9923ab"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9]\n"
     ]
    }
   ],
   "source": [
    "col = \"sepal_length\"\n",
    "sepal_lengths = df[col].tolist()\n",
    "print(sepal_lengths[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pfFctNGv1Cok"
   },
   "source": [
    "FiftyOne supports this functionality as well. For instance, each image in our dataset has a [uniqueness](https://voxel51.com/docs/fiftyone/tutorials/uniqueness.html) score, which is a measure of how unique a given image is in the context of the complete dataset. We can extract these values for each image using the [values()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.values) method as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "wP0AZJJV2Zk8",
    "outputId": "fea3f801-f5cf-414b-9eae-aa182545294d"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.8175834390151201, 0.6844698885072961, 0.725267119762334, 0.7164587220038886, 0.6874799405473135, 0.6773349111042449, 0.6948791555330056, 0.6157872732023304, 0.6692531238595459, 0.7257486965960712]\n"
     ]
    }
   ],
   "source": [
    "uniqueness = ds.values(\"uniqueness\")\n",
    "print(uniqueness[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lpNbARwl26Xk"
   },
   "source": [
    "Some of the relevant information for computer vision tasks, however, is less structured. In our example dataset, this is the case for both the `ground_truth` and `predictions` fields, each of which contains a number of object detections in the *embedded* `detections` field. The `values` method also gives us access to these embedded fields. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "URMsKWgI5uZe"
   },
   "source": [
    "Let's see this in action by using the `values` method to pull out the confidence score for each predicted detection:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "id": "hKRW20Sft5Hw"
   },
   "outputs": [],
   "source": [
    "pred_confs = ds.values(\"predictions.detections.confidence\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "nLQDxL4J6Hs9",
    "outputId": "78c3edb1-9473-439a-bb6f-6e027a36c829"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'list'>\n",
      "200\n",
      "<class 'list'>\n"
     ]
    }
   ],
   "source": [
    "print(type(pred_confs))\n",
    "print(len(pred_confs))\n",
    "print(type(pred_confs[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "G4WIPCmu6MxX"
   },
   "source": [
    "As with `values(\"uniqueness\")`, we get a list with one result per image. However, now we have a sublist for each image, rather than just a single value. We can peak inside one of these sublists at the confidence scores for each detection:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "kjyU3g4Y7p3D",
    "outputId": "9398d340-b21a-4483-d47f-2d5ed8e64ed7"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.9750854969024658, 0.759726881980896, 0.6569182276725769, 0.2359301745891571, 0.221974179148674, 0.1965726613998413, 0.18904592096805573, 0.11480894684791565, 0.11089690029621124, 0.0971052274107933, 0.08403241634368896, 0.07699568569660187, 0.058097004890441895, 0.0519101656973362]\n"
     ]
    }
   ],
   "source": [
    "print(pred_confs[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jaDM2knz69Gc"
   },
   "source": [
    "Let's get the lengths of these sublists and print the first few. In the section on `fo.Expression`, we will see a more natural (and efficient) way of performing this operation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "orOXKrPh7YoI",
    "outputId": "dc3c1611-ef83-43e7-d26a-29f61157dd6d"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[14, 20, 10, 51, 27, 13, 2, 9, 7, 13]\n"
     ]
    }
   ],
   "source": [
    "pred_conf_lens = [len(p) for p in pred_confs]\n",
    "print(pred_conf_lens[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vZpqLr0171yy"
   },
   "source": [
    "We can see that the number of confidence scores - and correspondingly the number of predictions - for each image is not fixed. This scenario is fairly typical in object detection tasks, where images can have varying numbers of objects!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "74lmew1HaZLt"
   },
   "source": [
    "## View stages"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ujd0Qj8yOQiy"
   },
   "source": [
    "### Making a copy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7NVfg10BOrkE"
   },
   "source": [
    "Suppose we want to make a copy of the original data and modify the copy without the changes propagating back to the original."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1A1AnJxlO53t"
   },
   "source": [
    "In pandas, we can do this with the `copy` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "veM5YQiMOXMo",
    "outputId": "aea0a4fa-9f2b-436c-a86e-729c93b36fef"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal_length  sepal_width  petal_length  petal_width species\n",
       "0           5.1          3.5           1.4          0.2  setosa\n",
       "1           4.9          3.0           1.4          0.2  setosa\n",
       "2           4.7          3.2           1.3          0.2  setosa\n",
       "3           4.6          3.1           1.5          0.2  setosa\n",
       "4           5.0          3.6           1.4          0.2  setosa"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "copy_df = df.copy()\n",
    "copy_df['species'] = 'none'\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dt12a0FjOXAo"
   },
   "source": [
    "In FiftyOne, we can do this with the [clone()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.clone) method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "nEBr6O9mOWYP",
    "outputId": "732d7846-2625-467e-d4f8-da856e5f572a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "quickstart\n"
     ]
    }
   ],
   "source": [
    "copy_ds = ds.clone()\n",
    "copy_ds.name = 'copy_ds'\n",
    "print(ds.name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kOJ2_NKeeXa9"
   },
   "source": [
    "### Slicing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ckouFS-gjPVc"
   },
   "source": [
    "In pandas if we want to get a slice of a `DataFrame`, we can do so with the notation `df[start:end]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "id": "mC7taJ26lLrI"
   },
   "outputs": [],
   "source": [
    "start = 10\n",
    "end = 14"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 175
    },
    "id": "D17nMUbpeXjn",
    "outputId": "fdbd6de9-d225-48dd-e018-09ae9a91b48b"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>5.4</td>\n",
       "      <td>3.7</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>4.8</td>\n",
       "      <td>3.4</td>\n",
       "      <td>1.6</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>4.8</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.1</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>4.3</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.1</td>\n",
       "      <td>0.1</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    sepal_length  sepal_width  petal_length  petal_width species\n",
       "10           5.4          3.7           1.5          0.2  setosa\n",
       "11           4.8          3.4           1.6          0.2  setosa\n",
       "12           4.8          3.0           1.4          0.1  setosa\n",
       "13           4.3          3.0           1.1          0.1  setosa"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[start:end]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Cngym8UjeXqc"
   },
   "source": [
    "In FiftyOne, a `Dataset` can be sliced using the same notation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "V02h4IWzeXv_",
    "outputId": "351dc1ea-f05f-4579-a551-6ebb0b63db56"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset:     quickstart\n",
       "Media type:  image\n",
       "Num samples: 4\n",
       "Sample fields:\n",
       "    id:              fiftyone.core.fields.ObjectIdField\n",
       "    filepath:        fiftyone.core.fields.StringField\n",
       "    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
       "    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)\n",
       "    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
       "    uniqueness:      fiftyone.core.fields.FloatField\n",
       "    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
       "    eval_tp:         fiftyone.core.fields.IntField\n",
       "    eval_fp:         fiftyone.core.fields.IntField\n",
       "    eval_fn:         fiftyone.core.fields.IntField\n",
       "    abstractness:    fiftyone.core.fields.FloatField\n",
       "    new_const_field: fiftyone.core.fields.IntField\n",
       "    computed_field:  fiftyone.core.fields.IntField\n",
       "View stages:\n",
       "    1. Skip(skip=10)\n",
       "    2. Limit(limit=4)"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds[start:end]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Uiic_xtGeX1q"
   },
   "source": [
    "However, as we can see from the output of the preceding command, this is merely syntactical sugar for the expression:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "R7-xnk5eeX7d",
    "outputId": "ca9e1eda-2fbc-4a4a-e9ce-1c71f402c8b5"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset:     quickstart\n",
       "Media type:  image\n",
       "Num samples: 4\n",
       "Sample fields:\n",
       "    id:              fiftyone.core.fields.ObjectIdField\n",
       "    filepath:        fiftyone.core.fields.StringField\n",
       "    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
       "    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)\n",
       "    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
       "    uniqueness:      fiftyone.core.fields.FloatField\n",
       "    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
       "    eval_tp:         fiftyone.core.fields.IntField\n",
       "    eval_fp:         fiftyone.core.fields.IntField\n",
       "    eval_fn:         fiftyone.core.fields.IntField\n",
       "    abstractness:    fiftyone.core.fields.FloatField\n",
       "    new_const_field: fiftyone.core.fields.IntField\n",
       "    computed_field:  fiftyone.core.fields.IntField\n",
       "View stages:\n",
       "    1. Skip(skip=10)\n",
       "    2. Limit(limit=4)"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds.skip(start).limit(end - start)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2XY5bkbtuUkS"
   },
   "source": [
    "which utilizes the [skip()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.skip) and [limit()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.limit) methods."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "U3R4Z6yreYBM"
   },
   "source": [
    "### Get random samples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_bJ30tBreYG8"
   },
   "source": [
    "When working with datasets, it is often the case that one might want to select a random set of samples. One typically wants either (a) a fixed number of random samples, or (b) to sample some fraction of the data randomly. We will show how to do both:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "EzMukmRXeYMp"
   },
   "source": [
    "#### Select $k$ random samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "id": "mCa1WdVbeYSo"
   },
   "outputs": [],
   "source": [
    "k = 20"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zKinLXr3nEDe"
   },
   "source": [
    "In pandas, you can use the `sample()` method, passing in either a number, as in `sample(n = k)`, or a fraction, as we show below "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "id": "fEDfLTSfeYY6"
   },
   "outputs": [],
   "source": [
    "rand_samples_df = df.sample(n=k)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "gwj4tJmuoLk4",
    "outputId": "7f8f3650-d066-4b74-f3ac-a705a12d30bf"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>101</th>\n",
       "      <td>5.8</td>\n",
       "      <td>2.7</td>\n",
       "      <td>5.1</td>\n",
       "      <td>1.9</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>129</th>\n",
       "      <td>7.2</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.8</td>\n",
       "      <td>1.6</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>79</th>\n",
       "      <td>5.7</td>\n",
       "      <td>2.6</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.0</td>\n",
       "      <td>versicolor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>6.3</td>\n",
       "      <td>3.3</td>\n",
       "      <td>6.0</td>\n",
       "      <td>2.5</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     sepal_length  sepal_width  petal_length  petal_width     species\n",
       "101           5.8          2.7           5.1          1.9   virginica\n",
       "129           7.2          3.0           5.8          1.6   virginica\n",
       "1             4.9          3.0           1.4          0.2      setosa\n",
       "79            5.7          2.6           3.5          1.0  versicolor\n",
       "100           6.3          3.3           6.0          2.5   virginica"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rand_samples_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xSYTCWd7eYfY"
   },
   "source": [
    "In FiftyOne, we can use the [take()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.take) method, to which we can pass in a random seed, or let it seed the random number generator with the time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "id": "4TYYzNh6eYmT"
   },
   "outputs": [],
   "source": [
    "rand_samples_ds = ds.take(k, seed=123)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Gp4u7pqYeYtF",
    "outputId": "22c0acaa-9e50-4a68-9832-fe994bada74b"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset:     quickstart\n",
       "Media type:  image\n",
       "Num samples: 20\n",
       "Sample fields:\n",
       "    id:              fiftyone.core.fields.ObjectIdField\n",
       "    filepath:        fiftyone.core.fields.StringField\n",
       "    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
       "    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)\n",
       "    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
       "    uniqueness:      fiftyone.core.fields.FloatField\n",
       "    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
       "    eval_tp:         fiftyone.core.fields.IntField\n",
       "    eval_fp:         fiftyone.core.fields.IntField\n",
       "    eval_fn:         fiftyone.core.fields.IntField\n",
       "    abstractness:    fiftyone.core.fields.FloatField\n",
       "    new_const_field: fiftyone.core.fields.IntField\n",
       "    computed_field:  fiftyone.core.fields.IntField\n",
       "View stages:\n",
       "    1. Take(size=20, seed=123)"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rand_samples_ds"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "u80zfFhxUOoL"
   },
   "source": [
    "With the [random utils](https://voxel51.com/docs/fiftyone/api/fiftyone.utils.random.html) in FiftyOne, you can also sample flexibly with user-input weighting schemes, but that is beyond the present scope."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "e5MgaRUmaSEF"
   },
   "source": [
    "#### Randomly select fraction $p<1$ of samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "id": "6OF5u-ysoYBo"
   },
   "outputs": [],
   "source": [
    "p = 0.05"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "_z6bHMcjoX5j",
    "outputId": "68d2ac38-dd14-46b5-c29f-7e84616570ad"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>140</th>\n",
       "      <td>6.7</td>\n",
       "      <td>3.1</td>\n",
       "      <td>5.6</td>\n",
       "      <td>2.4</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>5.8</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1.2</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.3</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58</th>\n",
       "      <td>6.6</td>\n",
       "      <td>2.9</td>\n",
       "      <td>4.6</td>\n",
       "      <td>1.3</td>\n",
       "      <td>versicolor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>90</th>\n",
       "      <td>5.5</td>\n",
       "      <td>2.6</td>\n",
       "      <td>4.4</td>\n",
       "      <td>1.2</td>\n",
       "      <td>versicolor</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     sepal_length  sepal_width  petal_length  petal_width     species\n",
       "140           6.7          3.1           5.6          2.4   virginica\n",
       "14            5.8          4.0           1.2          0.2      setosa\n",
       "40            5.0          3.5           1.3          0.3      setosa\n",
       "58            6.6          2.9           4.6          1.3  versicolor\n",
       "90            5.5          2.6           4.4          1.2  versicolor"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.sample(frac=p).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "t80HvDkcoXy5",
    "outputId": "a3e0197f-e9c7-47d9-a404-3001c9c45936"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset:     quickstart\n",
       "Media type:  image\n",
       "Num samples: 10\n",
       "Sample fields:\n",
       "    id:              fiftyone.core.fields.ObjectIdField\n",
       "    filepath:        fiftyone.core.fields.StringField\n",
       "    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
       "    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)\n",
       "    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
       "    uniqueness:      fiftyone.core.fields.FloatField\n",
       "    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
       "    eval_tp:         fiftyone.core.fields.IntField\n",
       "    eval_fp:         fiftyone.core.fields.IntField\n",
       "    eval_fn:         fiftyone.core.fields.IntField\n",
       "    abstractness:    fiftyone.core.fields.FloatField\n",
       "    new_const_field: fiftyone.core.fields.IntField\n",
       "    computed_field:  fiftyone.core.fields.IntField\n",
       "View stages:\n",
       "    1. Take(size=10, seed=123)"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We need to convert from fraction p to an integer k\n",
    "k = int(len(ds) * p)\n",
    "ds.take(k, seed=123)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9Hn2wZQsoXcc"
   },
   "source": [
    "### Shuffle data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IRl7NSmipOnc"
   },
   "source": [
    "In a similar vein to randomly selecting samples, one might want to create a new view in which the entire dataset is shuffled."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "VdQ2ccv0p0sr"
   },
   "source": [
    "In pandas, we can accomplish this by randomly sampling all the rows ($\\mathrm{frac}=1$) without replacement:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "id": "SKra1WDppOfx"
   },
   "outputs": [],
   "source": [
    "shuffled_df_view = df.sample(frac=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HCQrq7vHpOYO"
   },
   "source": [
    "In FiftyOne, we can just call the [shuffle()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.shuffle) method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "id": "grHrEpMcpOR1"
   },
   "outputs": [],
   "source": [
    "shuffled_ds_view = ds.shuffle(seed=123)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OVPcsbAWpOK6"
   },
   "source": [
    "### Filtering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Xbw_o6NopODs"
   },
   "source": [
    "It is also quite natural to want to filter out the data based on some condition. For the Iris data, for instance, let's get all of the flowers that have a sepal length greater than seven:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "id": "68QKzLxqpN96"
   },
   "outputs": [],
   "source": [
    "sepal_length_thresh = 7\n",
    "large_sepal_len_view = df[df.sepal_length > sepal_length_thresh]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "ijnrG1xLpN5E",
    "outputId": "dd86ee05-0db2-438d-c3f0-624ec48aff25"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "12\n",
      "     sepal_length  sepal_width  petal_length  petal_width    species\n",
      "102           7.1          3.0           5.9          2.1  virginica\n",
      "105           7.6          3.0           6.6          2.1  virginica\n",
      "107           7.3          2.9           6.3          1.8  virginica\n",
      "109           7.2          3.6           6.1          2.5  virginica\n",
      "117           7.7          3.8           6.7          2.2  virginica\n"
     ]
    }
   ],
   "source": [
    "print(len(large_sepal_len_view))\n",
    "print(large_sepal_len_view.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WMz2NvQN-8ew"
   },
   "source": [
    "In FiftyOne, we can perform an analogous filtering operation on the quickstart images, using the [match()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.match) method and the [ViewField](https://voxel51.com/docs/fiftyone/api/fiftyone.core.expressions.html#fiftyone.core.expressions.ViewField) to select all images that have a \"uniqueness\" score above some threshold:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "nAaka_GN-ovX",
    "outputId": "fb9a6fa2-49b9-4567-ee0a-207062d830a7"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset:     quickstart\n",
      "Media type:  image\n",
      "Num samples: 8\n",
      "Sample fields:\n",
      "    id:              fiftyone.core.fields.ObjectIdField\n",
      "    filepath:        fiftyone.core.fields.StringField\n",
      "    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
      "    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)\n",
      "    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
      "    uniqueness:      fiftyone.core.fields.FloatField\n",
      "    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
      "    eval_tp:         fiftyone.core.fields.IntField\n",
      "    eval_fp:         fiftyone.core.fields.IntField\n",
      "    eval_fn:         fiftyone.core.fields.IntField\n",
      "    abstractness:    fiftyone.core.fields.FloatField\n",
      "    new_const_field: fiftyone.core.fields.IntField\n",
      "    computed_field:  fiftyone.core.fields.IntField\n",
      "View stages:\n",
      "    1. Match(filter={'$expr': {'$gt': [...]}})\n",
      "values:  [0.8175834390151201, 1.0, 0.922046961894074, 0.799848556973409, 0.7806850524560267, 0.7950646615140298, 0.7505336395700778, 0.7530639609974709]\n"
     ]
    }
   ],
   "source": [
    "unique_thresh = 0.75\n",
    "unique_view = ds.match(F(\"uniqueness\") > unique_thresh)\n",
    "print(unique_view)\n",
    "print(\"values: \", unique_view.values(\"uniqueness\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0aSqvBle_8uL"
   },
   "source": [
    "However, in FiftyOne, given the potentially nested structure of the data in a `Dataset`, we can perform far more complex filtering operations using the same machinery, combined with the [filter()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.expressions.html#fiftyone.core.expressions.ViewExpression.filter) method. Crucially, these matching and filtering operations apply equally well to embedded fields. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "40p_bnU-AkPf"
   },
   "source": [
    "As an example, let's say we want to filter for all images in our dataset that had at least one object prediction with very high confidence. In this case, the confidence score is an embedded field within the predicted detections for each image. Thus, we can create a filter on confidence scores, and then apply this filter to the embedded `detections` field within `predictions`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "id": "Z8nlu7ySpNuY"
   },
   "outputs": [],
   "source": [
    "high_conf_filter = F(\"confidence\") > 0.995\n",
    "\n",
    "high_conf_view = ds.match(\n",
    "    F(\"predictions.detections\").filter(high_conf_filter).length() > 0\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Qc6u6eYtZUbW",
    "outputId": "88850b79-c376-4281-f4a9-8686380c2dd4"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset:     quickstart\n",
       "Media type:  image\n",
       "Num samples: 116\n",
       "Sample fields:\n",
       "    id:              fiftyone.core.fields.ObjectIdField\n",
       "    filepath:        fiftyone.core.fields.StringField\n",
       "    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
       "    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)\n",
       "    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
       "    uniqueness:      fiftyone.core.fields.FloatField\n",
       "    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
       "    eval_tp:         fiftyone.core.fields.IntField\n",
       "    eval_fp:         fiftyone.core.fields.IntField\n",
       "    eval_fn:         fiftyone.core.fields.IntField\n",
       "    abstractness:    fiftyone.core.fields.FloatField\n",
       "    new_const_field: fiftyone.core.fields.IntField\n",
       "    computed_field:  fiftyone.core.fields.IntField\n",
       "View stages:\n",
       "    1. Match(filter={'$expr': {'$gt': [...]}})"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "high_conf_view"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ss7r0OBD1qRR"
   },
   "source": [
    "For video tasks, the method [match_frames()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.collections.html#fiftyone.core.collections.SampleCollection.match_frames) allows one to perform filtering on the frames of a video collection."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "z4FKrX4EB1J_"
   },
   "source": [
    "We explore this filtering and matching machinery a little more in the section on expressions, but a comprehensive discussion will be the subject of an upcoming tutorial."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tWQ60utnsKhy"
   },
   "source": [
    "### Sorting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "v3goJCxCs8W_"
   },
   "source": [
    "We might also want to sort by certain properties. Let's see how that is done in pandas and FiftyOne."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Qqg6HqC5tJoF"
   },
   "source": [
    "In pandas, we use the `sort_values` method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1zonuaaDtE9o"
   },
   "source": [
    "Suppose that we want to sort by petal length. We can do this as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "id": "eAXwmCUdtamv"
   },
   "outputs": [],
   "source": [
    "petal_length_view = df.sort_values(by=\"petal_length\", ascending=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "ckRLUq-Atgfe",
    "outputId": "f400b28e-4eaa-4f84-afaf-65a83cc02cc9"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>118</th>\n",
       "      <td>7.7</td>\n",
       "      <td>2.6</td>\n",
       "      <td>6.9</td>\n",
       "      <td>2.3</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>122</th>\n",
       "      <td>7.7</td>\n",
       "      <td>2.8</td>\n",
       "      <td>6.7</td>\n",
       "      <td>2.0</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>117</th>\n",
       "      <td>7.7</td>\n",
       "      <td>3.8</td>\n",
       "      <td>6.7</td>\n",
       "      <td>2.2</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>105</th>\n",
       "      <td>7.6</td>\n",
       "      <td>3.0</td>\n",
       "      <td>6.6</td>\n",
       "      <td>2.1</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>131</th>\n",
       "      <td>7.9</td>\n",
       "      <td>3.8</td>\n",
       "      <td>6.4</td>\n",
       "      <td>2.0</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     sepal_length  sepal_width  petal_length  petal_width    species\n",
       "118           7.7          2.6           6.9          2.3  virginica\n",
       "122           7.7          2.8           6.7          2.0  virginica\n",
       "117           7.7          3.8           6.7          2.2  virginica\n",
       "105           7.6          3.0           6.6          2.1  virginica\n",
       "131           7.9          3.8           6.4          2.0  virginica"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "petal_length_view.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mM7_UB-Uthhx"
   },
   "source": [
    "In FiftyOne, we use the [sort_by()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.collections.html#fiftyone.core.collections.SampleCollection.sort_by) method. Let's sort the samples by the number of \"ground truth\" objects in the sample images: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "id": "OeJFLk6_vMRY"
   },
   "outputs": [],
   "source": [
    "field = \"ground_truth.detections\"\n",
    "view = ds.sort_by(F(field).length(), reverse=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "b80j_2YyvyBl",
    "outputId": "acfee42d-ba37-4f08-e3a4-11a017b600d1"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "39\n",
      "0\n"
     ]
    }
   ],
   "source": [
    "print(len(view.first().ground_truth.detections))  # 39\n",
    "print(len(view.last().ground_truth.detections))  # 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WpPazPxBv1x6"
   },
   "source": [
    "Now we can see that the most crowded image has $39$ objects, while the least crowded image is actually empty!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1RYikAZARPMw"
   },
   "source": [
    "### Deleting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sHqf3_gdRVP5"
   },
   "source": [
    "If we are resource-constrained, we can delete old `DataFrame` or `Dataset` objects so that they no longer occupy memory."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "KYLHuDqeRid1"
   },
   "source": [
    "In pandas we do this using the `del` command and the garbage collector utility. To delete the `petal_length_view` view, we can do the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "BepjhRfFR_LE",
    "outputId": "daa0032d-8af3-4486-f10d-69ba549eeaa0"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "16"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import gc\n",
    "del petal_length_view\n",
    "gc.collect()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bFHWiRD9SYT0"
   },
   "source": [
    "In FiftyOne, we can use the built-in [delete()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.delete_dataset) method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "id": "PH4BefvMSdW6"
   },
   "outputs": [],
   "source": [
    "copy_ds.delete()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GhRFsLk9WpJP"
   },
   "source": [
    "It is also worth mentioning that in FiftyOne, the `Dataset` is best thought of as an in-memory object. This means that a `Dataset` is deleted after closing Python (this is true in both Python interpreters and notebooks). If you want to use the dataset in the future, you can avoid this end-of-session deletion by setting the `persistent` property to `True`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "id": "3Qvxp_fvXtpN"
   },
   "outputs": [],
   "source": [
    "ds.persistent = True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "p6JbX0A3v_Cq"
   },
   "source": [
    "## Aggregations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Gp3u0HjLAyBq"
   },
   "source": [
    "Given a set of values for a column or field, it is often desired to compute aggregate quantities over all of these values. pandas `DataFrame` objects and FiftyOne `Dataset` objects both come with this functionality built in. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tpnAbQ7qBbsx"
   },
   "source": [
    "The general syntax is that in pandas, aggregations are methods of `pd.Series` objects, which represent the columns in a `DataFrame`. In FiftyOne, the aggregations are methods of the `Dataset` or `DatasetView` object, which take as *input* the field to be aggregated over."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4nf4TcEBB73G"
   },
   "source": [
    "### Count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "H1VGDuqaCo6p"
   },
   "source": [
    "In both pandas and FiftyOne, the [count()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.count) method returns the total number of occurrences."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Hn59oBPvCilr"
   },
   "source": [
    "In pandas, this counts the number of values in the column, which is by construction equal to the number of rows in the `DataFrame`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "EXNLP3HsC4o2",
    "outputId": "2c2b56ba-3978-45d4-b543-091afcb54403"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "150\n",
      "150\n"
     ]
    }
   ],
   "source": [
    "print(df['species'].count())\n",
    "print(len(df))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dWVz8VdYCMSL"
   },
   "source": [
    "In FiftyOne, the `count` method returns the total number of occurrences of a certain field, which is *not* necessarily the same as the number of samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "xGfxXsMcCCLE",
    "outputId": "f68d26bc-5e7b-4317-8bd3-4af3465bd122"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "200\n",
      "5620\n"
     ]
    }
   ],
   "source": [
    "num_predictions = ds.count(\"predictions.detections.label\")\n",
    "print(len(ds))\n",
    "print(num_predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dsb7SdV4HtYf"
   },
   "source": [
    "### Sum"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "VdrNwe_QHu1K"
   },
   "source": [
    "Both pandas and FiftyOne have the [sum()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.sum) method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "36BiaYfWH09w",
    "outputId": "2fc60036-1c2c-406c-a8e3-9805aa1db8f6"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "876.5\n"
     ]
    }
   ],
   "source": [
    "sum_sepal_lengths = df.sepal_length.sum()\n",
    "print(sum_sepal_lengths)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "POLpmswxH9uQ",
    "outputId": "f5e5805b-ce2d-4f65-97cc-384b9cc0e30c"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1966.6705134399235\n"
     ]
    }
   ],
   "source": [
    "sum_pred_confs = ds.sum(\"predictions.detections.confidence\")\n",
    "print(sum_pred_confs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NeGVcN02CGhE"
   },
   "source": [
    "### Unique"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1paxI5X7DS_J"
   },
   "source": [
    "In pandas, the `unique` method returns a list of all unique values in the input `pd.Series`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "kumsTSUKDk94",
    "outputId": "7b06c43c-74bd-4aaf-cf6d-d1bcd149260c"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['setosa', 'versicolor', 'virginica'], dtype=object)"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.species.unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "m7b7BRI_DooA"
   },
   "source": [
    "In FiftyOne, the [distinct()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.distinct) method reproduces this functionality."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "mg6iDg9lDwOc",
    "outputId": "a41b8729-4452-4610-dfbe-d78b4c133cad"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['banana',\n",
       " 'bed',\n",
       " 'bench',\n",
       " 'bicycle',\n",
       " 'bird',\n",
       " 'boat',\n",
       " 'book',\n",
       " 'bowl',\n",
       " 'broccoli',\n",
       " 'bus',\n",
       " 'cake',\n",
       " 'car',\n",
       " 'carrot',\n",
       " 'cat',\n",
       " 'cell phone',\n",
       " 'chair',\n",
       " 'clock',\n",
       " 'couch',\n",
       " 'cow',\n",
       " 'cup',\n",
       " 'dining table',\n",
       " 'dog',\n",
       " 'elephant',\n",
       " 'fire hydrant',\n",
       " 'fork',\n",
       " 'frisbee',\n",
       " 'giraffe',\n",
       " 'handbag',\n",
       " 'horse',\n",
       " 'keyboard',\n",
       " 'kite',\n",
       " 'knife',\n",
       " 'laptop',\n",
       " 'person',\n",
       " 'pizza',\n",
       " 'sandwich',\n",
       " 'scissors',\n",
       " 'sheep',\n",
       " 'skateboard',\n",
       " 'skis',\n",
       " 'snowboard',\n",
       " 'spoon',\n",
       " 'sports ball',\n",
       " 'stop sign',\n",
       " 'surfboard',\n",
       " 'tie',\n",
       " 'traffic light',\n",
       " 'train',\n",
       " 'truck',\n",
       " 'tv',\n",
       " 'umbrella']"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rand_samples_ds.distinct(\"predictions.detections.label\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "73Qtpw64DzlQ"
   },
   "source": [
    "### Bounds"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LDlRv8ASEYX_"
   },
   "source": [
    "In pandas, you compute the minimum and maximum value of a `pd.Series` separately:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "whcx2zt-EbLQ",
    "outputId": "43dd3abc-c633-4e9c-ad21-e1b742cb1bce"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "min_sepal_len: 4.3, max_sepal_len: 7.9\n"
     ]
    }
   ],
   "source": [
    "min_sepal_len = df.sepal_length.min()\n",
    "max_sepal_len = df.sepal_length.max()\n",
    "print(\"min_sepal_len: {}, max_sepal_len: {}\".format(min_sepal_len, max_sepal_len))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "gtlrVc2IEdiL"
   },
   "source": [
    "When working with a FiftyOne Dataset or DataView, the min and max are returned together in a tuple when the [bounds()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.bounds) method is called on a field:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "mbm3C-MAEkMe",
    "outputId": "a67463b6-df45-4986-cd70-33ffe6ad3213"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "min_pred_conf: 0.05003104358911514, max_pred_conf: 0.9999035596847534\n"
     ]
    }
   ],
   "source": [
    "(min_pred_conf, max_pred_conf) = ds.bounds(\"predictions.detections.confidence\")\n",
    "print(\"min_pred_conf: {}, max_pred_conf: {}\".format(min_pred_conf, max_pred_conf))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "EOp1T0CBEm_U"
   },
   "source": [
    "### Mean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BNBG_zcDEw7l"
   },
   "source": [
    "Both pandas `DataFrame` objects and FiftyOne `Dataset` objects employ the method [mean()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.mean)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "g_oDsBmfEwzh",
    "outputId": "0dfa727a-bbad-412c-8e01-9a7e4ed3d8d0"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5.843333333333334\n"
     ]
    }
   ],
   "source": [
    "mean_sepal_len = df.sepal_length.mean()\n",
    "print(mean_sepal_len)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "diC0Vv5tEws9",
    "outputId": "6997418b-6aec-4b2c-fa7b-9d328cbcbc10"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.34994137249820706\n"
     ]
    }
   ],
   "source": [
    "mean_pred_conf = ds.mean(\"predictions.detections.confidence\")\n",
    "print(mean_pred_conf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7IgdFVxyEwl3"
   },
   "source": [
    "### Standard deviation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ou8OIv-yEwfa"
   },
   "source": [
    "Both pandas `DataFrame` objects and FiftyOne `Dataset` objects employ the method [std()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.std):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "mY5Ohu6DEwYg",
    "outputId": "e6aad6ec-6794-4cf6-93a3-090add1a3425"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.828066127977863\n"
     ]
    }
   ],
   "source": [
    "std_sepal_len = df.sepal_length.std()\n",
    "print(std_sepal_len)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "1ZWViOMiEwSO",
    "outputId": "41abc96b-75aa-4b67-fc3e-76c246ddee0e"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.3184061813934825\n"
     ]
    }
   ],
   "source": [
    "std_pred_conf = ds.std(\"predictions.detections.confidence\")\n",
    "print(std_pred_conf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8jRe7jOLEwML"
   },
   "source": [
    "### Quantiles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5Dsdx6aBEwGE"
   },
   "source": [
    "If you don't want just the mean, but instead want the value for a given column or field at arbitrary percentiles in the dataset, you can use the [quantiles()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.quantiles) method, which takes in a list of percentiles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "id": "RedXYHr_Ev-j"
   },
   "outputs": [],
   "source": [
    "percentiles = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "AxLdNiYTGHHA",
    "outputId": "d62f1b0b-1ed1-4236-f1da-8dbf985581ba"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.0    4.30\n",
      "0.2    5.00\n",
      "0.4    5.60\n",
      "0.6    6.10\n",
      "0.8    6.52\n",
      "1.0    7.90\n",
      "Name: sepal_length, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "sepal_len_quanties = df.sepal_length.quantile(percentiles)\n",
    "print(sepal_len_quanties)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Gd7Y6AAlGHWc",
    "outputId": "8a3b7f82-e588-4189-fb09-9af3fd17d6f9"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.05003104358911514, 0.08101843297481537, 0.14457139372825623, 0.2922309935092926, 0.6890143156051636, 0.9999035596847534]\n"
     ]
    }
   ],
   "source": [
    "pred_conf_quantiles = ds.quantiles(\"predictions.detections.confidence\", percentiles)\n",
    "print(pred_conf_quantiles)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9vbI9akCGX0H"
   },
   "source": [
    "### Median and other aggregations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9cT1xl2fGX8s"
   },
   "source": [
    "Some aggregations which are native to pandas, such as computing the median, are not native to FiftyOne. In these cases, the canonical way to compute the aggregation is by first extracting the values from the `Dataset` field, and then using native numpy or scipy functionality. \n",
    "\n",
    "Here we illustrate this procedure for computing the median. If you use the `values` method on the `predictions.detections.confidence` field with default arguments, we get a jagged array. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "_KtxAwT2YUAA",
    "outputId": "0b7cb903-595c-47a2-a3bb-67a9dbc1a80c"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[14, 20, 10, 51, 27, 13, 2, 9, 7, 13]\n",
      "5620\n"
     ]
    }
   ],
   "source": [
    "pred_confs_jagged = ds.values(\"predictions.detections.confidence\")\n",
    "print([len(pc) for pc in pred_confs_jagged][:10])\n",
    "print(sum([len(pc) for pc in pred_confs_jagged]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8pED-KTzZcge"
   },
   "source": [
    "However, we can simplify our lives by flattening the result passing in the argument `unwind = True`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "-P96LZ6KGYEj",
    "outputId": "da37def1-9cb1-4ea5-e637-81f83cbce614"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5620\n"
     ]
    }
   ],
   "source": [
    "pred_confs_flat = ds.values(\"predictions.detections.confidence\", unwind = True)\n",
    "print(len(pred_confs_flat))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wXUhAATsnFF8"
   },
   "source": [
    "And from this we can easily compute the median:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "hhEhVTTYnFgG",
    "outputId": "608e6f5a-85b5-4e1f-c476-cd363606b232"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.20251326262950897\n"
     ]
    }
   ],
   "source": [
    "pred_confs_median = np.median(pred_confs_flat)\n",
    "print(pred_confs_median)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xWxDFFgmGYMO"
   },
   "source": [
    "## Structural change operations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ffd76UxqLu29"
   },
   "source": [
    "### Add new column/field"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "oq6ffA1UoTNO"
   },
   "source": [
    "There are many scenarios in which one might want to add another column/field to a dataset. From a practical standpoint, these come in three distinct flavors.\n",
    "1. Add a new column/field with a default (constant) value for each row/sample.\n",
    "2. Add new column/field defined with external or already computed data.\n",
    "3. Create new column/field programmatically from other columns/fields.\n",
    "\n",
    "In this section we show how to efficiently handle each of these cases in pandas and FiftyOne."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "I70xfztGpZvV"
   },
   "source": [
    "#### Add new column/field with default value"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "iFG6DWrJp1Kn"
   },
   "source": [
    "In pandas, the easiest way to create a new column `const_col` with constant  value `const_val` is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "fts-XK0zqaxY",
    "outputId": "0c570248-d320-40ae-b1f5-7203cf8fe9ff"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "      <th>const_col</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal_length  sepal_width  petal_length  petal_width species  const_col\n",
       "0           5.1          3.5           1.4          0.2  setosa  const_val\n",
       "1           4.9          3.0           1.4          0.2  setosa  const_val\n",
       "2           4.7          3.2           1.3          0.2  setosa  const_val\n",
       "3           4.6          3.1           1.5          0.2  setosa  const_val\n",
       "4           5.0          3.6           1.4          0.2  setosa  const_val"
      ]
     },
     "execution_count": 85,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['const_col'] = 'const_val'\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3IK6dNrAqj5X"
   },
   "source": [
    "which implicitly broadcasts the single value `const_val` to all rows in the `DataFrame`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4JKsiyhFtjk6"
   },
   "source": [
    "In FiftyOne, the canonical process for efficiently creating and populating a new field involves three steps. (1) a new field is added to the `Dataset` using the [add_sample_field()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.add_sample_field) method with `add_sample_field(field_name, ftype)`. (2) The field is populated, using either [set_field()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.set_field) or [set_values()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.set_values), as we will illustrate below. (3) the `Dataset` or `DatasetView` is saved using [save()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.save), saving the changes.\n",
    "\n",
    "There is one key distinction in usage between `set_field` and `set_values`. Whereas `set_values` sets the values on the `Dataset` directly, using `set_field` creates a new `DatasetView`, and this `DatasetView` is what must be saved!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before illustrating these more efficient approaches, it is also worth mentioning that you can also loop through the samples in a `Dataset` or `DatasetView` and add or set fields one at a time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [],
   "source": [
    "for sample in ds.iter_samples(autosave=True):\n",
    "    sample[\"new_const_field\"] = 51\n",
    "    sample[\"computed_field\"] = len(sample.ground_truth.detections)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, this is *not* an efficient approach. It is recommended to use `set_field` or `set_values` instead."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BpXSzyWvvExd"
   },
   "source": [
    "In the simplest scenario - analogous to the Pandas example above, we can pass a single value into `set_field` along with the name of the field:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "xY87L19nrYJ_",
    "outputId": "25c0cf75-68b0-4170-97ce-2232e93ceefc"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('id', 'filepath', 'tags', 'metadata', 'ground_truth', 'uniqueness', 'predictions', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field', 'const_field')\n",
      "['const_val', 'const_val', 'const_val', 'const_val', 'const_val', 'const_val', 'const_val', 'const_val', 'const_val', 'const_val']\n"
     ]
    }
   ],
   "source": [
    "ds.add_sample_field(\"const_field\", fo.StringField)\n",
    "view = ds.set_field(\"const_field\", \"const_val\")\n",
    "view.save()\n",
    "\n",
    "print(ds.first().field_names)\n",
    "print(ds.values(\"const_field\")[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8GrBn6wgr0WU"
   },
   "source": [
    "As we will see shortly, however, `set_field` is far more flexible and powerful than this, as a result of FiftyOne's robust matching and filtering capabilities."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HtcSxN2Lsb5T"
   },
   "source": [
    "#### Add new column/field from external data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6bLTpqT9soJU"
   },
   "source": [
    "Starting with pandas, suppose that our data team comes to us and tells us that now they also have the stem length for each flower, and they want us to incorporate that data into our models. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_ivyQY8ZsroV"
   },
   "source": [
    "For instance, let's say the stem lengths are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {
    "id": "ld_UPFj4L12v"
   },
   "outputs": [],
   "source": [
    "stem_lengths = np.random.uniform(5, 10, len(df))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_HUXG5qRMJfi"
   },
   "source": [
    "We can add this into our dataset using a similar syntax as above. The only difference is that this time, the assignment is taking in an array (here a numpy array) instead of a single value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {
    "id": "La9t0QvHJiew"
   },
   "outputs": [],
   "source": [
    "df['stem_length'] = stem_lengths"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "Ig5bIifxJiYf",
    "outputId": "b30b1b25-d915-459c-ce52-41a94079a433"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "      <th>const_col</th>\n",
       "      <th>stem_length</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "      <td>9.519895</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "      <td>9.230470</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "      <td>8.312255</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "      <td>6.762648</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "      <td>8.624046</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal_length  sepal_width  petal_length  petal_width species  const_col  \\\n",
       "0           5.1          3.5           1.4          0.2  setosa  const_val   \n",
       "1           4.9          3.0           1.4          0.2  setosa  const_val   \n",
       "2           4.7          3.2           1.3          0.2  setosa  const_val   \n",
       "3           4.6          3.1           1.5          0.2  setosa  const_val   \n",
       "4           5.0          3.6           1.4          0.2  setosa  const_val   \n",
       "\n",
       "   stem_length  \n",
       "0     9.519895  \n",
       "1     9.230470  \n",
       "2     8.312255  \n",
       "3     6.762648  \n",
       "4     8.624046  "
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3DckxLGVxd4R"
   },
   "source": [
    "In FiftyOne, we can do something similar by passing an array of values into `set_values`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wMht819axeM4"
   },
   "source": [
    "As an example, let's say we have an `abstractness` score between zero and one for each image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {
    "id": "ZTeGf6ShxeYQ"
   },
   "outputs": [],
   "source": [
    "abstractness = np.random.uniform(0, 1, len(ds))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "9ZhTl77jxejV",
    "outputId": "67d48d79-d4d3-4dbf-9725-01587d243b7b"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('id', 'filepath', 'tags', 'metadata', 'ground_truth', 'uniqueness', 'predictions', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field', 'const_field')\n",
      "[0.18992196548662132, 0.4195423356383746, 0.9782249923275138, 0.3555547463728417, 0.9019379850096877, 0.3647814428112852, 0.3030278060870243, 0.241988161650587, 0.7872455674533378, 0.44774858997738953]\n"
     ]
    }
   ],
   "source": [
    "ds.set_values(\"abstractness\", abstractness)\n",
    "print(ds.first().field_names)\n",
    "print(ds.values(\"abstractness\")[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5x2aeW-b3Qlz"
   },
   "source": [
    "Note that when using `set_values` we are modifying the `Dataset` directly. Thus, as opposed to `set_field`, we do not need to preface the method call with `add_sample_field`, and we do not need to explicitly save the `Dataset` with `save` afterwards."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HIEju-rhxevP"
   },
   "source": [
    "#### Add a new column/frame from existing columns/fields"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "STE_fzSHz2XV"
   },
   "source": [
    "Finally, often either in the process of feature engineering or data analysis, you want to generate new columns or fields from existing ones."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5o1-XV140PXC"
   },
   "source": [
    "In pandas, the canonical way of doing this is with the `apply` method. Suppose we want to create a new feature called \"sepal volume\" derived by taking the product of sepal length and sepal width. With `apply` we can map row-wise onto the columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {
    "id": "iRYsiqmy0duu"
   },
   "outputs": [],
   "source": [
    "df[\"sepal_volume\"] = df.apply(lambda x: x[\"sepal_length\"]*x[\"sepal_width\"], axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "Asd7ERVh0djt",
    "outputId": "b1890752-ed23-445c-cc7d-67ccc3e81a96"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "      <th>const_col</th>\n",
       "      <th>stem_length</th>\n",
       "      <th>sepal_volume</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "      <td>9.519895</td>\n",
       "      <td>17.85</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "      <td>9.230470</td>\n",
       "      <td>14.70</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "      <td>8.312255</td>\n",
       "      <td>15.04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "      <td>6.762648</td>\n",
       "      <td>14.26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "      <td>8.624046</td>\n",
       "      <td>18.00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal_length  sepal_width  petal_length  petal_width species  const_col  \\\n",
       "0           5.1          3.5           1.4          0.2  setosa  const_val   \n",
       "1           4.9          3.0           1.4          0.2  setosa  const_val   \n",
       "2           4.7          3.2           1.3          0.2  setosa  const_val   \n",
       "3           4.6          3.1           1.5          0.2  setosa  const_val   \n",
       "4           5.0          3.6           1.4          0.2  setosa  const_val   \n",
       "\n",
       "   stem_length  sepal_volume  \n",
       "0     9.519895         17.85  \n",
       "1     9.230470         14.70  \n",
       "2     8.312255         15.04  \n",
       "3     6.762648         14.26  \n",
       "4     8.624046         18.00  "
      ]
     },
     "execution_count": 94,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Wg1kRt1D0dSy"
   },
   "source": [
    "In FiftyOne, we can perform operations like this by combining `set_field` with the `Viewfield`, here loaded as `F`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_JUuT-8H1b_l"
   },
   "source": [
    "To compute the number of predicted object detections for each sample in the `Dataset` we can write:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "e_iOqbPhJiMN",
    "outputId": "ee425231-2516-4fec-e0e1-20cf1af6da89"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('detections', 'num_predictions')\n",
      "[14, 20, 10, 51, 27, 13, 2, 9, 7, 13]\n"
     ]
    }
   ],
   "source": [
    "view = ds.set_field(\n",
    "    \"predictions.num_predictions\",\n",
    "    F(\"$predictions.detections\").length(),\n",
    ")\n",
    "view.save()\n",
    "print(ds.first().predictions.field_names)\n",
    "print(ds.values(\"predictions.num_predictions\")[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "AxY5QqRJ2IYf"
   },
   "source": [
    "The above also highlights that all of the aforementioned operations also work on embedded fields. Note however that as we are not changing the base field_schema, we do not need to call `add_sample_field`!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XAQK39EQGBOc"
   },
   "source": [
    "### Remove a column/field "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "AwAxBX5gGBVV"
   },
   "source": [
    "Sometimes you want to look at a dataset *without* a certain column/field. More precisely, there are two related things one might want to do.\n",
    "1. Create a new view of the dataset without specific column/field, or\n",
    "2. Delete specific column/field from the original dataset.\n",
    "\n",
    "Here, we show how to do both of these in Pandas and FiftyOne."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OfSmD3C-GBdF"
   },
   "source": [
    "In pandas, you can create a view without specific columns using the `drop` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "fS-g_SpdHNQe",
    "outputId": "2ce236b7-4e35-4ee6-8d0e-7c35e91612d4"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "      <th>const_col</th>\n",
       "      <th>stem_length</th>\n",
       "      <th>sepal_volume</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "      <td>9.519895</td>\n",
       "      <td>17.85</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "      <td>9.230470</td>\n",
       "      <td>14.70</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "      <td>8.312255</td>\n",
       "      <td>15.04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "      <td>6.762648</td>\n",
       "      <td>14.26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>const_val</td>\n",
       "      <td>8.624046</td>\n",
       "      <td>18.00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal_length  sepal_width  petal_length  petal_width species  const_col  \\\n",
       "0           5.1          3.5           1.4          0.2  setosa  const_val   \n",
       "1           4.9          3.0           1.4          0.2  setosa  const_val   \n",
       "2           4.7          3.2           1.3          0.2  setosa  const_val   \n",
       "3           4.6          3.1           1.5          0.2  setosa  const_val   \n",
       "4           5.0          3.6           1.4          0.2  setosa  const_val   \n",
       "\n",
       "   stem_length  sepal_volume  \n",
       "0     9.519895         17.85  \n",
       "1     9.230470         14.70  \n",
       "2     8.312255         15.04  \n",
       "3     6.762648         14.26  \n",
       "4     8.624046         18.00  "
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "jjB7oezdHNJQ",
    "outputId": "d9c85d0f-a88b-42f9-8d4b-88acfc708f8f"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "      <th>stem_length</th>\n",
       "      <th>sepal_volume</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>9.519895</td>\n",
       "      <td>17.85</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>9.230470</td>\n",
       "      <td>14.70</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>8.312255</td>\n",
       "      <td>15.04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>6.762648</td>\n",
       "      <td>14.26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>8.624046</td>\n",
       "      <td>18.00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal_length  sepal_width  petal_length  petal_width species  stem_length  \\\n",
       "0           5.1          3.5           1.4          0.2  setosa     9.519895   \n",
       "1           4.9          3.0           1.4          0.2  setosa     9.230470   \n",
       "2           4.7          3.2           1.3          0.2  setosa     8.312255   \n",
       "3           4.6          3.1           1.5          0.2  setosa     6.762648   \n",
       "4           5.0          3.6           1.4          0.2  setosa     8.624046   \n",
       "\n",
       "   sepal_volume  \n",
       "0         17.85  \n",
       "1         14.70  \n",
       "2         15.04  \n",
       "3         14.26  \n",
       "4         18.00  "
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "no_const_view = df.drop([\"const_col\"], axis=1)\n",
    "# equvalent to df.drop(columns=[\"const\"])\n",
    "\n",
    "no_const_view.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zbRIioNIQPwW"
   },
   "source": [
    "If one wants to delete the column from the original `DataFrame`, one does so by assigning the variable for the original `DataFrame` to the dropped view:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "aR5HXNPUQXb1",
    "outputId": "3f9acc0a-c83a-4e82-bf41-4f625050382f"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "      <th>stem_length</th>\n",
       "      <th>sepal_volume</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>9.519895</td>\n",
       "      <td>17.85</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>9.230470</td>\n",
       "      <td>14.70</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>8.312255</td>\n",
       "      <td>15.04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>6.762648</td>\n",
       "      <td>14.26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>8.624046</td>\n",
       "      <td>18.00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal_length  sepal_width  petal_length  petal_width species  stem_length  \\\n",
       "0           5.1          3.5           1.4          0.2  setosa     9.519895   \n",
       "1           4.9          3.0           1.4          0.2  setosa     9.230470   \n",
       "2           4.7          3.2           1.3          0.2  setosa     8.312255   \n",
       "3           4.6          3.1           1.5          0.2  setosa     6.762648   \n",
       "4           5.0          3.6           1.4          0.2  setosa     8.624046   \n",
       "\n",
       "   sepal_volume  \n",
       "0         17.85  \n",
       "1         14.70  \n",
       "2         15.04  \n",
       "3         14.26  \n",
       "4         18.00  "
      ]
     },
     "execution_count": 98,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = df.drop([\"const_col\"], axis=1)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QuHfZLXSRBYk"
   },
   "source": [
    "In FiftyOne, you can create a `ViewStage` without a particular field using the [exclude_fields()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.exclude_fields) method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Djv8TrsZRRI1",
    "outputId": "c12aaef2-7bae-4b77-da8d-375fc04cd8dc"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('id', 'filepath', 'tags', 'metadata', 'ground_truth', 'uniqueness', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field', 'const_field')\n"
     ]
    }
   ],
   "source": [
    "no_predictions_view = ds.exclude_fields(\"predictions\")\n",
    "print(no_predictions_view.first().field_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "EcSkG3c7RfwH"
   },
   "source": [
    "Alternatively, you can delete a field from the `Dataset` using [delete_sample_field()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.delete_sample_field)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "vSxMrbU-TvVs",
    "outputId": "cd4274b5-32b9-452b-b6bf-4c05ca3d5927"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('id', 'filepath', 'tags', 'metadata', 'ground_truth', 'uniqueness', 'predictions', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field')\n"
     ]
    }
   ],
   "source": [
    "ds.delete_sample_field(\"const_field\")\n",
    "print(ds.first().field_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "L4a9papmUC-F"
   },
   "source": [
    "Both the `exclude_field` and `delete_sample_field` methods also work with embedded fields:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "mMB364TCUNc7",
    "outputId": "ac6492bc-6945-4f1c-a611-ec9389cac69c"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('detections',)\n"
     ]
    }
   ],
   "source": [
    "ds.delete_sample_field(\"predictions.num_predictions\")\n",
    "print(ds.first().predictions.field_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2c_JpeMyHNCz"
   },
   "source": [
    "To delete multiple fields at once, you can use the related [delete_sample_fields()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.delete_sample_fields) method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PHs-nuOwHM3S"
   },
   "source": [
    "### Keep only specified columns/fields"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "50GOsE_AHMxv"
   },
   "source": [
    "Alternatively, if you only want to create a view with a small subset of columns/fields, it might be easier to specify those directly. As with removing columns, this can be done in a way that creates a new view while preserving the original, or in a way that deletes the columns/fields from the original dataset. We show both approaches below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Kq6zLgzDHMrW"
   },
   "source": [
    "In pandas, to create a new view with only the \"sepal_length\" and \"sepal_width\" columns, one could write:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "TcyYhDHnGBkb",
    "outputId": "4217d398-a3a1-4b63-fe5f-96e0e2750623"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal_length  sepal_width\n",
       "0           5.1          3.5\n",
       "1           4.9          3.0\n",
       "2           4.7          3.2\n",
       "3           4.6          3.1\n",
       "4           5.0          3.6"
      ]
     },
     "execution_count": 102,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sepal_df = df[[\"sepal_length\", \"sepal_width\"]]\n",
    "sepal_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "g32TIwNQWgrr"
   },
   "source": [
    "In contrast, the following propagates the changes back to the original `DataFrame`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "0joz2Xz8Wrpm",
    "outputId": "a0861863-8b1d-4e07-d349-74aeca2971b9"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal_length\n",
       "0           5.1\n",
       "1           4.9\n",
       "2           4.7\n",
       "3           4.6\n",
       "4           5.0"
      ]
     },
     "execution_count": 103,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sepal_df = sepal_df[[\"sepal_length\"]]\n",
    "sepal_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GXgOL6H7Y0CQ"
   },
   "source": [
    "In FiftyOne, if we want to create a separate view with only specified fields kept, we should first clone the original dataset and then apply the [select_fields()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.select_fields) method. when we apply the [keep_fields()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.patches.html#fiftyone.core.patches.EvaluationPatchesView.keep_fields) method following application of `select_fields`, the changes propagate from the `DatasetView` back to the underlying `Dataset`.\n",
    "\n",
    "Let's create two clones of our base `Dataset` to showcase this distinction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {
    "id": "NDj8th9faQXD"
   },
   "outputs": [],
   "source": [
    "ds_clone1 = ds.clone()\n",
    "ds_clone2 = ds.clone()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fCYRSQG6alS5"
   },
   "source": [
    "For both of these clones, let's create views which select only the `ground_truth` field:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "iFAlZ3UXawWN",
    "outputId": "0303dd4f-79de-45f0-d0c9-d03a0a883903"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('id', 'filepath', 'tags', 'metadata', 'ground_truth')\n",
      "('id', 'filepath', 'tags', 'metadata', 'ground_truth')\n"
     ]
    }
   ],
   "source": [
    "clone1_view = ds_clone1.select_fields(\"ground_truth\")\n",
    "clone2_view = ds_clone2.select_fields(\"ground_truth\")\n",
    "print(clone1_view.first().field_names)\n",
    "print(clone2_view.first().field_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Rv_NBv27a4Oz"
   },
   "source": [
    "The `id`, `filepath`, `tags`, and `metadata` are by default preserved, even when not passed in to `select_fields`. Aside from these and `ground_truth`, all other fields have been omitted from view. Now let's only apply `keep_fields` on the first clone, and see what changes propagate back."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {
    "id": "QHWZjciwbfZS"
   },
   "outputs": [],
   "source": [
    "clone1_view.keep_fields()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "oWzrIzogbkch",
    "outputId": "e316568b-ae67-4edf-88c6-931ade4c6c56"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('id', 'filepath', 'tags', 'metadata', 'ground_truth')\n",
      "('id', 'filepath', 'tags', 'metadata', 'ground_truth', 'uniqueness', 'predictions', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field')\n"
     ]
    }
   ],
   "source": [
    "print(ds_clone1.first().field_names)\n",
    "print(ds_clone2.first().field_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GG9mgt_LbrrS"
   },
   "source": [
    "As we can see, the changes only propagated back to the dataset (in this case `ds_clone1`) when we applied `keep_fields`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jQmnfvc1Ke-s"
   },
   "source": [
    "Finally, we note that when dealing with video datasets, the methods `exclude_fields` and `select_fields` have analogous methods for frames - [exclude_frames()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.collections.html#fiftyone.core.collections.SampleCollection.exclude_frames) and [select_frames()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.collections.html#fiftyone.core.collections.SampleCollection.select_frames)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kIaqBu3YJhun"
   },
   "source": [
    "### Concatenation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GSJBOnvdJhjs"
   },
   "source": [
    "Suppose we have two datasets we want to combine or concatenate. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GyyLFBRwJhXC"
   },
   "source": [
    "In both pandas and FiftyOne, we can concatenate them using the `concat` method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In pandas, we can combine two `DataFrame` objects:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "ln-7C08EJhQ6",
    "outputId": "da111c3e-617d-4755-ff08-6b39f8e9ca71"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100\n"
     ]
    }
   ],
   "source": [
    "df1 = df[df.species == 'setosa']\n",
    "df2 = df[df.species == 'virginica']\n",
    "concat_df = pd.concat([df1, df2])\n",
    "print(len(concat_df))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wgzDXX5FJhKq"
   },
   "source": [
    "In FiftyOne, we can use the [concat()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.concat) method to combine views from the same dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {
    "id": "81RdB31LJhEq"
   },
   "outputs": [],
   "source": [
    "view1 = ds.match(F(\"uniqueness\") < 0.2)\n",
    "view2 = ds.match(F(\"uniqueness\") > 0.7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Cxx0qcuLJg-n",
    "outputId": "0003848d-b7e2-454c-c140-775adaa9a6cb"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "19\n",
      "17\n"
     ]
    }
   ],
   "source": [
    "print(len(view1))\n",
    "print(len(view2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "3Y9MqkeVJg45",
    "outputId": "0fd15890-ed09-4165-f432-a2ef474c7841"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "36\n",
      "36\n"
     ]
    }
   ],
   "source": [
    "concat_view = view1.concat(view2)\n",
    "print(len(view1) + len(view2))\n",
    "print(len(concat_view))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tiN_bOpn4JiX"
   },
   "source": [
    "The slightly more complicated operation of concatenating `Dataset` objects `ds1` and `ds2` (as opposed to `DatasetView` objects) can be achieved using [merge_samples()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html?highlight=merge_samples#fiftyone.core.dataset.Dataset.merge_samples), i.e., `ds1.merge_samples(ds2)`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JvGLa12tJgyu"
   },
   "source": [
    "### Adding a single row/sample"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kMBlQMptJgr0"
   },
   "source": [
    "Often times, we just want to enhance a dataset by adding in one sample at a time. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "61iO8umoJgdf"
   },
   "source": [
    "In pandas, the fastest way to do this is to use the same `concat` method as above. If the row data is in a dictionary format, we convert it to its own `DataFrame` first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 112,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "50"
      ]
     },
     "execution_count": 112,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 113,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "GkBEtXKRGYT8",
    "outputId": "2b1ab2c5-a0b8-47b1-df0a-dfaa23be02bb"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "51\n"
     ]
    }
   ],
   "source": [
    "single_row = df2.iloc[0]\n",
    "df1_plus = pd.concat([df1, pd.DataFrame([single_row])], axis=1)\n",
    "print(len(df1_plus))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xEjPLJl1DS0A"
   },
   "source": [
    "In FiftyOne, we can use the [add_sample()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.add_sample) method. Notice that this is an *in-place* operation, and no assignment is needed. Also note that this does not work for views - a sample can only be added to a `Dataset`, not to a `Dataview`. As such, we first clone the view to turn it into its own `Dataset`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "ai0sV4jQDY7E",
    "outputId": "4619e5bb-6a4c-4b1c-9c5e-5f80e8ecdbab"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "19\n",
      "20\n"
     ]
    }
   ],
   "source": [
    "single_sample = view2.first()\n",
    "view1_plus = view1.clone()\n",
    "print(len(view1_plus))\n",
    "view1_plus.add_sample(single_sample)\n",
    "print(len(view1_plus))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Dk-Qa_xpDZFb"
   },
   "source": [
    "We can also add a collection of samples to a dataset using the [add_samples()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.add_samples) method, which takes as input a list of `fo.Sample` objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "kuAR8waRDZLo",
    "outputId": "8b1490ee-07e5-45a1-fcf2-926df49c9da4"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "20\n",
      " 100% |█████████████████████| 3/3 [35.6ms elapsed, 0s remaining, 84.2 samples/s]     \n",
      "23\n"
     ]
    }
   ],
   "source": [
    "print(len(view1_plus))\n",
    "view1_plus.add_samples(view2.skip(1).head(3))\n",
    "print(len(view1_plus))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NyHWJ8lhDZRx"
   },
   "source": [
    "### Remove rows/samples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_gdnHfnKd1z-"
   },
   "source": [
    "The same in-place vs out-of-place considerations for pandas, and `Dataset` vs `DatasetView` considerations for FiftyOne apply to rows/samples as applied to columns/fields."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ukDdb2H2fTjk"
   },
   "source": [
    "In pandas, rows are removed by index using the `drop` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 116,
   "metadata": {
    "id": "M1EsOMfDej7X"
   },
   "outputs": [],
   "source": [
    "### Randomly select a set of rows to remove\n",
    "import random\n",
    "rows_to_remove = random.sample(range(len(df)), 10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "KdebKli5fbZ6"
   },
   "source": [
    "To create a new view:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "yMEfOaYkfErC",
    "outputId": "59244a83-eb27-40ae-9fc5-57f76061111c"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "140\n",
      "150\n"
     ]
    }
   ],
   "source": [
    "sub_df = df.drop(rows_to_remove)\n",
    "print(len(sub_df))\n",
    "print(len(df))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jMRiJ_J1DZYZ"
   },
   "source": [
    "To remove the rows from the original `DataFrame`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "MqsNFYhNDZfC",
    "outputId": "4d812ffc-9ce6-46d2-ab74-244f32862c2e"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "140\n"
     ]
    }
   ],
   "source": [
    "copy_df = df.copy()\n",
    "copy_df = copy_df.drop(rows_to_remove)\n",
    "print(len(copy_df))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nNwB6H_tgCYH"
   },
   "source": [
    "In FiftyOne, [exclude()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.exclude) creates a view without the specified samples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "metadata": {
    "id": "nw3-UM5Ygq9n"
   },
   "outputs": [],
   "source": [
    "samples_to_remove = ds.take(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 120,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "-EXytcxbgszW",
    "outputId": "3434716b-0510-4cf4-bf23-b64e90a7dfd9"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "200\n",
      "190\n",
      "<class 'fiftyone.core.view.DatasetView'>\n"
     ]
    }
   ],
   "source": [
    "sub_view = ds.exclude(samples_to_remove)\n",
    "print(len(ds))\n",
    "print(len(sub_view))\n",
    "print(type(sub_view))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_rRxuQIZhKKm"
   },
   "source": [
    "On the other hand, [delete_samples()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.delete_samples) is an in-place operation which deletes the samples from the underlying `Dataset`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "FsdKeL0ShYcZ",
    "outputId": "772a76b9-1949-444a-d488-6d88d3ac77b3"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "190\n"
     ]
    }
   ],
   "source": [
    "sub_ds = ds.clone()\n",
    "sub_ds.delete_samples(samples_to_remove)\n",
    "print(len(sub_ds))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "S_c-vGgZiErY"
   },
   "source": [
    "### Keep only specified rows/samples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "--V2eLEWiNnf"
   },
   "source": [
    "As with columns/fields, one might want to pick out specific rows/samples. In the section on filtering and expressions, we'll cover more advanced operations. Here we show how to select the data corresponding to a given list of rows/samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {
    "id": "TiSZVRXqiNZY"
   },
   "outputs": [],
   "source": [
    "rows_to_keep = list(random.sample(range(len(df)), 80))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Boh_DmIJiNN3",
    "outputId": "a0f77804-ef86-4b82-9064-882cb76ba7b8"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "80\n"
     ]
    }
   ],
   "source": [
    "sub_df = df.iloc[rows_to_keep]\n",
    "print(len(sub_df))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "fg5vZXauiNG6",
    "outputId": "13b058f9-9ff5-4209-a316-6749741af2b7"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "80\n",
      "80\n"
     ]
    }
   ],
   "source": [
    "sample_ids = ds.values(\"id\")\n",
    "ids_to_keep = [sample_ids[ind] for ind in rows_to_keep]\n",
    "print(len(ids_to_keep))\n",
    "print(len(ds.select(ids_to_keep)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Wh_QYco6mNYd"
   },
   "source": [
    "### Rename column/field"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XrIwFzjjiNAU"
   },
   "source": [
    "In pandas, you can rename columns by passing a dictionary or mapping into the `rename()` method with the `columns` argument. This is *not* an in-place operation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "HLZaN5jciM5D",
    "outputId": "63047541-45fa-4c09-a980-831d8b46f731"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sl</th>\n",
       "      <th>sw</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "      <th>stem_length</th>\n",
       "      <th>sepal_volume</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>9.519895</td>\n",
       "      <td>17.85</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>9.230470</td>\n",
       "      <td>14.70</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>8.312255</td>\n",
       "      <td>15.04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>6.762648</td>\n",
       "      <td>14.26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>8.624046</td>\n",
       "      <td>18.00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    sl   sw  petal_length  petal_width species  stem_length  sepal_volume\n",
       "0  5.1  3.5           1.4          0.2  setosa     9.519895         17.85\n",
       "1  4.9  3.0           1.4          0.2  setosa     9.230470         14.70\n",
       "2  4.7  3.2           1.3          0.2  setosa     8.312255         15.04\n",
       "3  4.6  3.1           1.5          0.2  setosa     6.762648         14.26\n",
       "4  5.0  3.6           1.4          0.2  setosa     8.624046         18.00"
      ]
     },
     "execution_count": 125,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "renamed_df = df.rename(columns = {\"sepal_length\": \"sl\", \"sepal_width\": \"sw\"})\n",
    "renamed_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1V0rUFNkoVAe"
   },
   "source": [
    "In FiftyOne, you can rename fields using an analogous (but in-place) name mapping, passed in to the [rename_sample_fields()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.rename_sample_fields) method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 126,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "JyzD_j5ziLlo",
    "outputId": "50092db1-53a6-4f6c-fd81-0176e470a201"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('id', 'filepath', 'tags', 'metadata', 'gt', 'uniqueness', 'pred', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field')\n"
     ]
    }
   ],
   "source": [
    "renamed_ds = ds.clone()\n",
    "renamed_ds.rename_sample_fields({\"ground_truth\": \"gt\", \"predictions\":\"pred\"})\n",
    "print(renamed_ds.first().field_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TOkWh8JAorfN"
   },
   "source": [
    "Alternatively, if you just want to rename a single field, you can also do so with the [rename_sample_field()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.rename_sample_field) method as `rename_sample_field(old_field_name, new_field_name)`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "TfItIybToquc",
    "outputId": "e169c658-aa70-4364-930f-eb4c321eae1a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('id', 'filepath', 'tags', 'metadata', 'gt_new', 'uniqueness', 'pred', 'eval_tp', 'eval_fp', 'eval_fn', 'abstractness', 'new_const_field', 'computed_field')\n"
     ]
    }
   ],
   "source": [
    "renamed_ds.rename_sample_field(\"gt\", \"gt_new\")\n",
    "print(renamed_ds.first().field_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vd7n-a0HpEfk"
   },
   "source": [
    "Both of these methods extend naturally to embedded fields:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8575063187115628"
      ]
     },
     "execution_count": 128,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "renamed_ds.first().pred.detections[0].eval_iou"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 129,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "OnAz_4fzpD5b",
    "outputId": "b0c96bcf-aba5-474e-f2f1-40fdc16d434b"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('id', 'attributes', 'tags', 'label', 'bounding_box', 'mask', 'confidence', 'index', 'eval', 'eval_id', 'iou')\n"
     ]
    }
   ],
   "source": [
    "renamed_ds.rename_sample_field(\"pred.detections.eval_iou\", \"pred.detections.iou\")\n",
    "print(renamed_ds.first().pred.detections[0].field_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dcUeljBbEv4O"
   },
   "source": [
    "## Expressions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8rWtLO4pEu6U"
   },
   "source": [
    "As introduced above, the `filter`, and `match` methods, along with the `ViewField`, can be remarkably useful in selecting subsets of datasets that satisfy user-defined conditions. In this section, we demonstrate how to combine these components to perform Pandas-style queries.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Uv9EI3ZjGzla"
   },
   "source": [
    "A common theme throughout this section is that while in pandas, expressions (over a given set of rows) can only be applied to the values in the columns, in FiftyOne, expressions can be applied to fields, including embedded fields, or directly to labels or tags! As such, FiftyOne provides [match_labels()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.collections.html#fiftyone.core.collections.SampleCollection.match_labels) and [match_tags()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.collections.html#fiftyone.core.collections.SampleCollection.match_tags) methods."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6pqVY7N1Fnjf"
   },
   "source": [
    "### Element comparison expressions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zUW7Fjx3Gt5U"
   },
   "source": [
    "In both pandas and FiftyOne, the element comparison operators `==`, `>`, `<`, `!=`, `>=`, and `<=` all conform to the same syntax. The following examples show this functionality."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "VSDK89XhJ860"
   },
   "source": [
    "#### Exact equality"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "9_d5iYbOFHso",
    "outputId": "4ef5a9d3-8db9-4610-b190-44d532a7068f"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "50\n"
     ]
    }
   ],
   "source": [
    "setosa_df = df[df.species == \"setosa\"]\n",
    "print(len(setosa_df))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 131,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "s-U-l5pCFI5b",
    "outputId": "a5311c07-7853-4e2e-da03-9e1be77ac9be"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset:     quickstart\n",
       "Media type:  image\n",
       "Num samples: 0\n",
       "Sample fields:\n",
       "    id:              fiftyone.core.fields.ObjectIdField\n",
       "    filepath:        fiftyone.core.fields.StringField\n",
       "    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
       "    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)\n",
       "    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
       "    uniqueness:      fiftyone.core.fields.FloatField\n",
       "    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
       "    eval_tp:         fiftyone.core.fields.IntField\n",
       "    eval_fp:         fiftyone.core.fields.IntField\n",
       "    eval_fn:         fiftyone.core.fields.IntField\n",
       "    abstractness:    fiftyone.core.fields.FloatField\n",
       "    new_const_field: fiftyone.core.fields.IntField\n",
       "    computed_field:  fiftyone.core.fields.IntField\n",
       "View stages:\n",
       "    1. Match(filter={'$expr': {'$eq': [...]}})"
      ]
     },
     "execution_count": 131,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds.match(F(\"filepath\") == '/root/fiftyone/quickstart/data/000880.jpg')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QLg6KgzsFXch"
   },
   "source": [
    "#### Less than or equal to"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 132,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "icBCb5ANFw2U",
    "outputId": "7baf8c12-0237-4f4d-9b74-f5b9d2483eac"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "      <th>stem_length</th>\n",
       "      <th>sepal_volume</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>9.230470</td>\n",
       "      <td>14.70</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>8.312255</td>\n",
       "      <td>15.04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>6.762648</td>\n",
       "      <td>14.26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>8.624046</td>\n",
       "      <td>18.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.4</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.3</td>\n",
       "      <td>setosa</td>\n",
       "      <td>5.066091</td>\n",
       "      <td>15.64</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal_length  sepal_width  petal_length  petal_width species  stem_length  \\\n",
       "1           4.9          3.0           1.4          0.2  setosa     9.230470   \n",
       "2           4.7          3.2           1.3          0.2  setosa     8.312255   \n",
       "3           4.6          3.1           1.5          0.2  setosa     6.762648   \n",
       "4           5.0          3.6           1.4          0.2  setosa     8.624046   \n",
       "6           4.6          3.4           1.4          0.3  setosa     5.066091   \n",
       "\n",
       "   sepal_volume  \n",
       "1         14.70  \n",
       "2         15.04  \n",
       "3         14.26  \n",
       "4         18.00  \n",
       "6         15.64  "
      ]
     },
     "execution_count": 132,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "short_sepal_cond = df.sepal_length <= 5\n",
    "short_sepal_df = df[short_sepal_cond]\n",
    "short_sepal_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 133,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "WlK8jBtmKfs1",
    "outputId": "98ccd96b-60eb-47ed-89aa-a8abf74bf178"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset:     quickstart\n",
       "Media type:  image\n",
       "Num samples: 19\n",
       "Sample fields:\n",
       "    id:              fiftyone.core.fields.ObjectIdField\n",
       "    filepath:        fiftyone.core.fields.StringField\n",
       "    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
       "    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)\n",
       "    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
       "    uniqueness:      fiftyone.core.fields.FloatField\n",
       "    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
       "    eval_tp:         fiftyone.core.fields.IntField\n",
       "    eval_fp:         fiftyone.core.fields.IntField\n",
       "    eval_fn:         fiftyone.core.fields.IntField\n",
       "    abstractness:    fiftyone.core.fields.FloatField\n",
       "    new_const_field: fiftyone.core.fields.IntField\n",
       "    computed_field:  fiftyone.core.fields.IntField\n",
       "View stages:\n",
       "    1. Match(filter={'$expr': {'$lte': [...]}})"
      ]
     },
     "execution_count": 133,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "non_unique_filter = F(\"uniqueness\") <= 0.2\n",
    "non_unique_view = ds.match(non_unique_filter)\n",
    "non_unique_view"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WkEGOzwIOE7W"
   },
   "source": [
    "### Logical expressions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2EHv8dWWKxm-"
   },
   "source": [
    "#### Logical complement"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3zHVz6d5MCVk"
   },
   "source": [
    "If we have an expression and we want to find all rows/samples that do not satisfy this expression, we can use the complement operator `~`. Let's use this to get the complementary rows/samples to those picked out by the expression above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "Nx-1asZwMthv",
    "outputId": "6e340b19-850b-46fb-c346-6ec966b5a959"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "      <th>stem_length</th>\n",
       "      <th>sepal_volume</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>9.519895</td>\n",
       "      <td>17.85</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5.4</td>\n",
       "      <td>3.9</td>\n",
       "      <td>1.7</td>\n",
       "      <td>0.4</td>\n",
       "      <td>setosa</td>\n",
       "      <td>9.171235</td>\n",
       "      <td>21.06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>5.4</td>\n",
       "      <td>3.7</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>8.236024</td>\n",
       "      <td>19.98</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>5.8</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1.2</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "      <td>5.914960</td>\n",
       "      <td>23.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>5.7</td>\n",
       "      <td>4.4</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.4</td>\n",
       "      <td>setosa</td>\n",
       "      <td>6.215238</td>\n",
       "      <td>25.08</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    sepal_length  sepal_width  petal_length  petal_width species  stem_length  \\\n",
       "0            5.1          3.5           1.4          0.2  setosa     9.519895   \n",
       "5            5.4          3.9           1.7          0.4  setosa     9.171235   \n",
       "10           5.4          3.7           1.5          0.2  setosa     8.236024   \n",
       "14           5.8          4.0           1.2          0.2  setosa     5.914960   \n",
       "15           5.7          4.4           1.5          0.4  setosa     6.215238   \n",
       "\n",
       "    sepal_volume  \n",
       "0          17.85  \n",
       "5          21.06  \n",
       "10         19.98  \n",
       "14         23.20  \n",
       "15         25.08  "
      ]
     },
     "execution_count": 134,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "non_short_sepal_df = df[~short_sepal_cond]\n",
    "non_short_sepal_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 135,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "bh5-KAeoM5h0",
    "outputId": "1f6ebf36-a9a5-41d7-84a9-59b33fc851b5"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset:     quickstart\n",
       "Media type:  image\n",
       "Num samples: 181\n",
       "Sample fields:\n",
       "    id:              fiftyone.core.fields.ObjectIdField\n",
       "    filepath:        fiftyone.core.fields.StringField\n",
       "    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
       "    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)\n",
       "    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
       "    uniqueness:      fiftyone.core.fields.FloatField\n",
       "    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
       "    eval_tp:         fiftyone.core.fields.IntField\n",
       "    eval_fp:         fiftyone.core.fields.IntField\n",
       "    eval_fn:         fiftyone.core.fields.IntField\n",
       "    abstractness:    fiftyone.core.fields.FloatField\n",
       "    new_const_field: fiftyone.core.fields.IntField\n",
       "    computed_field:  fiftyone.core.fields.IntField\n",
       "View stages:\n",
       "    1. Match(filter={'$expr': {'$not': {...}}})"
      ]
     },
     "execution_count": 135,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unique_view = ds.match(~non_unique_filter)\n",
    "unique_view"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "a2MmVtEjNRAj"
   },
   "source": [
    "#### Logical AND"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "AhRLVUx0OKbh"
   },
   "source": [
    "In pandas and FiftyOne, the logical `AND` of two conditions can be evaluated with the `&` operator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 136,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "i815JNDBO67c",
    "outputId": "c5e6a85e-96f9-4d53-e762-81667bf98057"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "109 rows satisfy condition1\n",
      "50 rows satisfy condition2\n",
      "43 rows satisfy condition1 AND condition2\n"
     ]
    }
   ],
   "source": [
    "pd_cond1 = (df.sepal_volume < 20)\n",
    "pd_cond2 = (df.species == \"setosa\")\n",
    "print(\"{} rows satisfy condition1\".format(len(df[pd_cond1])))\n",
    "print(\"{} rows satisfy condition2\".format(len(df[pd_cond2])))\n",
    "print(\"{} rows satisfy condition1 AND condition2\".format(len(df[pd_cond1 & pd_cond2])))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 137,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "kKcBvBg2PRJO",
    "outputId": "2333fed3-baf2-442a-b114-e81db3e77711"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100 samples satisfy condition1\n",
      "109 samples satisfy condition2\n",
      "9 samples satisfy condition1 AND condition2\n"
     ]
    }
   ],
   "source": [
    "fo_cond1 = F(\"uniqueness\") > 0.4\n",
    "fo_cond2 = F(\"uniqueness\") < 0.55\n",
    "print(\"{} samples satisfy condition1\".format(len(ds.match(fo_cond1))))\n",
    "print(\"{} samples satisfy condition2\".format(len(ds.match(fo_cond2))))\n",
    "print(\"{} samples satisfy condition1 AND condition2\".format(len(ds.match(fo_cond1 & fo_cond2))))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dNYoVqUCP7TA"
   },
   "source": [
    "Additionally, if we want to evaluate the logical `AND` of a list of conditions, in FiftyOne we can do so using [all()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.expressions.html?highlight=viewexpression#fiftyone.core.expressions.ViewExpression.all):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 138,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "VQRiIy56QXNN",
    "outputId": "79c1736e-af88-4ce4-9311-951eef0d659a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset:     quickstart\n",
      "Media type:  image\n",
      "Num samples: 5\n",
      "Sample fields:\n",
      "    id:              fiftyone.core.fields.ObjectIdField\n",
      "    filepath:        fiftyone.core.fields.StringField\n",
      "    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
      "    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)\n",
      "    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
      "    uniqueness:      fiftyone.core.fields.FloatField\n",
      "    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
      "    eval_tp:         fiftyone.core.fields.IntField\n",
      "    eval_fp:         fiftyone.core.fields.IntField\n",
      "    eval_fn:         fiftyone.core.fields.IntField\n",
      "    abstractness:    fiftyone.core.fields.FloatField\n",
      "    new_const_field: fiftyone.core.fields.IntField\n",
      "    computed_field:  fiftyone.core.fields.IntField\n",
      "View stages:\n",
      "    1. Match(filter={'$expr': {'$and': [...]}})\n"
     ]
    }
   ],
   "source": [
    "fo_cond3 = F(\"predictions.detections\").length() >= 10\n",
    "print(ds.match(F.all([fo_cond1, fo_cond2, fo_cond3])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sG-LnjhuQmTJ"
   },
   "source": [
    "#### Logical OR"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MglTLMFiR8RB"
   },
   "source": [
    "In pandas and FiftyOne, the logical `OR` of two conditions can be evaluated with the `|` operator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 139,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "M-imj7lUR_zV",
    "outputId": "035491b3-162c-4525-8005-35f0f30cd0ca"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "109 rows satisfy condition1\n",
      "50 rows satisfy condition2\n",
      "116 rows satisfy condition1 OR condition2\n"
     ]
    }
   ],
   "source": [
    "print(\"{} rows satisfy condition1\".format(len(df[pd_cond1])))\n",
    "print(\"{} rows satisfy condition2\".format(len(df[pd_cond2])))\n",
    "print(\"{} rows satisfy condition1 OR condition2\".format(len(df[pd_cond1 | pd_cond2])))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 140,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "lmZyLXn7SIC9",
    "outputId": "70f9a856-efb8-4391-a55a-c79bcaa9321a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100 samples satisfy condition1\n",
      "134 samples satisfy condition3\n",
      "166 samples satisfy condition1 OR condition3\n"
     ]
    }
   ],
   "source": [
    "print(\"{} samples satisfy condition1\".format(len(ds.match(fo_cond1))))\n",
    "print(\"{} samples satisfy condition3\".format(len(ds.match(fo_cond3))))\n",
    "print(\"{} samples satisfy condition1 OR condition3\".format(len(ds.match(fo_cond1 | fo_cond3))))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "go6m7-NBSROZ"
   },
   "source": [
    "Mirroring our usage of `all`, in FiftyOne we can use [any()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.expressions.html?highlight=viewexpression#fiftyone.core.expressions.ViewExpression.any) to evaluate the logical `OR` of a list of conditions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 141,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "o_KAGbffSvab",
    "outputId": "8c0cb479-a737-42f0-9fab-bb539b1e25c3"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset:     quickstart\n",
      "Media type:  image\n",
      "Num samples: 166\n",
      "Sample fields:\n",
      "    id:              fiftyone.core.fields.ObjectIdField\n",
      "    filepath:        fiftyone.core.fields.StringField\n",
      "    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
      "    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)\n",
      "    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
      "    uniqueness:      fiftyone.core.fields.FloatField\n",
      "    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
      "    eval_tp:         fiftyone.core.fields.IntField\n",
      "    eval_fp:         fiftyone.core.fields.IntField\n",
      "    eval_fn:         fiftyone.core.fields.IntField\n",
      "    abstractness:    fiftyone.core.fields.FloatField\n",
      "    new_const_field: fiftyone.core.fields.IntField\n",
      "    computed_field:  fiftyone.core.fields.IntField\n",
      "View stages:\n",
      "    1. Match(filter={'$expr': {'$or': [...]}})\n"
     ]
    }
   ],
   "source": [
    "print(ds.match(F.any([fo_cond1, fo_cond3])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_gA09zCZSzSl"
   },
   "source": [
    "We note that these `all` and `any` methods in FiftyOne are distinctly different from the methods with the same names in pandas."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "81sruuOWTJT_"
   },
   "source": [
    "### Subset-superset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jfkRsl5CVRDt"
   },
   "source": [
    "#### Is in"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lmUr2rs_Uvx_"
   },
   "source": [
    "In pandas, we can check whether the entries in a column are in a given list of values using the `isin` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 142,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "ndT5h3YxVVz6",
    "outputId": "ef1eee72-4ff4-485b-b5b6-dba988b8e10c"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       True\n",
       "1       True\n",
       "2       True\n",
       "3       True\n",
       "4       True\n",
       "       ...  \n",
       "145    False\n",
       "146    False\n",
       "147    False\n",
       "148    False\n",
       "149    False\n",
       "Name: species, Length: 150, dtype: bool"
      ]
     },
     "execution_count": 142,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.species.isin(['setosa', 'versicolor'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zBa4kCAxVXcR"
   },
   "source": [
    "In FiftyOne, the analogous method is [is_in()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.expressions.html?highlight=viewexpression#fiftyone.core.expressions.ViewExpression.is_in). We can filter our dataset for only detected animals, for instance, with the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 143,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "2QnTlE1OVcwM",
    "outputId": "56cd33ae-18e7-47b6-dd63-cb52552001f2"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset:     quickstart\n",
      "Media type:  image\n",
      "Num samples: 87\n",
      "Sample fields:\n",
      "    id:              fiftyone.core.fields.ObjectIdField\n",
      "    filepath:        fiftyone.core.fields.StringField\n",
      "    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
      "    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)\n",
      "    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
      "    uniqueness:      fiftyone.core.fields.FloatField\n",
      "    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
      "    eval_tp:         fiftyone.core.fields.IntField\n",
      "    eval_fp:         fiftyone.core.fields.IntField\n",
      "    eval_fn:         fiftyone.core.fields.IntField\n",
      "    abstractness:    fiftyone.core.fields.FloatField\n",
      "    new_const_field: fiftyone.core.fields.IntField\n",
      "    computed_field:  fiftyone.core.fields.IntField\n",
      "View stages:\n",
      "    1. FilterLabels(field='predictions', filter={'$in': ['$$this.label', [...]]}, only_matches=True, trajectories=False)\n"
     ]
    }
   ],
   "source": [
    "ANIMALS = [\n",
    "    \"bear\", \"bird\", \"cat\", \"cow\", \"dog\", \"elephant\", \"giraffe\",\n",
    "    \"horse\", \"sheep\", \"zebra\"\n",
    "]\n",
    "\n",
    "animal_view = ds.filter_labels(\"predictions\", F(\"label\").is_in(ANIMALS))\n",
    "print(animal_view)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "EcmVuwGAbM_S"
   },
   "source": [
    "Additionally, when the FiftyOne fields contain lists, we might want to check if these lists are subsets of other lists. We can do this with the [is_subset()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.expressions.html?highlight=viewexpression#fiftyone.core.expressions.ViewExpression.is_subset) method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 144,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "5IAaJqiVbze0",
    "outputId": "b8d59069-ad3f-4589-9c2b-b378ce3e2545"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 100% |█████████████████████| 1/1 [6.3ms elapsed, 0s remaining, 177.5 samples/s] \n",
      "[True]\n"
     ]
    }
   ],
   "source": [
    "empty_dataset.add_samples(\n",
    "    [\n",
    "        fo.Sample(\n",
    "            filepath=\"image1.jpg\",\n",
    "            tags=[\"a\", \"b\", \"a\", \"b\"]\n",
    "        )\n",
    "    ]\n",
    ")\n",
    "\n",
    "print(empty_dataset.values(F(\"tags\").is_subset([\"a\", \"b\", \"c\"])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fS2H43k1TmLE"
   },
   "source": [
    "#### Contains"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9KDEPKitWRcp"
   },
   "source": [
    "We can also flip this operation on its head and ask whether the column/field entries contain something else. In pandas, the entries in a `DataFrame` cannot be lists, so the only sensible type of containment is string containment, i.e., checking whether the strings in a column contain a substring:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 145,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "5Rxjq99HUIsB",
    "outputId": "c07fb895-e785-4b07-8c10-8f1dbcb0ea8c"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "50"
      ]
     },
     "execution_count": 145,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.species.str.contains(\"set\").sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Px32MuQrXKIe"
   },
   "source": [
    "This has a parallel in FiftyOne: [contains_str()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.expressions.html?highlight=viewexpression#fiftyone.core.expressions.ViewExpression.contains_str):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 146,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "DpFk660CUZ1e",
    "outputId": "a2ded9f4-7fdf-45e4-88b7-d8f75b4cd9bf"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset:     quickstart\n",
      "Media type:  image\n",
      "Num samples: 5\n",
      "Sample fields:\n",
      "    id:              fiftyone.core.fields.ObjectIdField\n",
      "    filepath:        fiftyone.core.fields.StringField\n",
      "    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
      "    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)\n",
      "    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
      "    uniqueness:      fiftyone.core.fields.FloatField\n",
      "    predictions:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)\n",
      "    eval_tp:         fiftyone.core.fields.IntField\n",
      "    eval_fp:         fiftyone.core.fields.IntField\n",
      "    eval_fn:         fiftyone.core.fields.IntField\n",
      "    abstractness:    fiftyone.core.fields.FloatField\n",
      "    new_const_field: fiftyone.core.fields.IntField\n",
      "    computed_field:  fiftyone.core.fields.IntField\n",
      "View stages:\n",
      "    1. FilterLabels(field='predictions', filter={'$regexMatch': {'input': '$$this.label', 'options': None, 'regex': 'ze'}}, only_matches=True, trajectories=False)\n"
     ]
    }
   ],
   "source": [
    "ze_view = ds.filter_labels(\"predictions\", F(\"label\").contains_str(\"ze\"))\n",
    "print(ze_view)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7Lr6Fq70aOdM"
   },
   "source": [
    "On a related note, FiftyOne has other useful string operations, including [starts_with()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.expressions.html?highlight=viewexpression#fiftyone.core.expressions.ViewExpression.ends_with) and [ends_with()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.expressions.html?highlight=viewexpression#fiftyone.core.expressions.ViewExpression.ends_with)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "AewYAkYQUftb"
   },
   "source": [
    "What's more, in FiftyOne, where fields themselves *can* be lists, we can check containment in those lists using the [contains()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.expressions.html?highlight=viewexpression#fiftyone.core.expressions.ViewExpression.contains) method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "DNlPn38sYLwc"
   },
   "source": [
    "If we want to create a view which contains either cats *or* dogs, we can do so with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 147,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Ql0kN-tCUjgu",
    "outputId": "ee0662cb-926d-43a0-950e-a767a6ab316a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "39\n"
     ]
    }
   ],
   "source": [
    "# Only contains samples with \"cat\" or \"dog\" predictions\n",
    "cats_or_dogs_view = ds.match(\n",
    "    F(\"predictions.detections.label\").contains([\"cat\", \"dog\"])\n",
    ")\n",
    "print(cats_or_dogs_view.count())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "30Jo29YPYcEm"
   },
   "source": [
    "If instead we want a view of all samples that contain both cats *and* dogs, we can pass in the `all=True` argument:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 148,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "qKiQFNAyYaB3",
    "outputId": "2039864f-ed10-475a-a67c-67d15fc2e430"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10\n"
     ]
    }
   ],
   "source": [
    "# Only contains samples with \"cat\" and \"dog\" predictions\n",
    "cats_and_dogs_view = ds.match(\n",
    "    F(\"predictions.detections.label\").contains([\"cat\", \"dog\"], all=True)\n",
    ")\n",
    "print(cats_and_dogs_view.count())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "o_TY95nhYtq2"
   },
   "source": [
    "### Checking data types"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GryDkCiufnKO"
   },
   "source": [
    "#### Numeric and string types"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_52SciLXfuPY"
   },
   "source": [
    "In recent versions of pandas, one can check if the data type of a `DataFrame` column is numeric or is a string by importing the corresponding functions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 149,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "cKucK_5Yfl6d",
    "outputId": "d77274f5-16d2-4a63-e8a0-4452e0410cc4"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True\n",
      "False\n"
     ]
    }
   ],
   "source": [
    "from pandas.api.types import is_string_dtype\n",
    "from pandas.api.types import is_numeric_dtype\n",
    "print(is_numeric_dtype(df.sepal_length))\n",
    "print(is_string_dtype(df.sepal_length))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "S6kuSV52gadJ"
   },
   "source": [
    "In FiftyOne, these are taken care of by the [is_number()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.expressions.html?highlight=viewexpression#fiftyone.core.expressions.ViewExpression.is_number) and [is_strin()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.expressions.html?highlight=viewexpression#fiftyone.core.expressions.ViewExpression.is_string) methods:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 150,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Tn74sr3Mg4r9",
    "outputId": "6ed081bb-1ce0-4b2c-b5ed-8934b9e1bf0d"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "200\n",
      "0\n"
     ]
    }
   ],
   "source": [
    "print(ds.match(F(\"uniqueness\").is_number()).count())\n",
    "print(ds.match(F(\"uniqueness\").is_string()).count())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xSApJToWdypb"
   },
   "source": [
    "#### Null"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ijow_efEeLEn"
   },
   "source": [
    "In pandas, one checks whether data is null using the `isna` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 151,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "kqJ4OlPzeZ03",
    "outputId": "b44caec4-8d12-406c-cef5-2206323a0231"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sepal_length    False\n",
       "sepal_width     False\n",
       "petal_length    False\n",
       "petal_width     False\n",
       "species         False\n",
       "stem_length     False\n",
       "sepal_volume    False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 151,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.isna().any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UvL0P8gjea8M"
   },
   "source": [
    "In FiftyOne, the [is_null()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.expressions.html?highlight=viewexpression#fiftyone.core.expressions.ViewExpression.is_null) method does this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 152,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "rVPwNZ5Gen0y",
    "outputId": "fe1efc10-d1cb-4f3e-9e12-facf171f6d98"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "92\n"
     ]
    }
   ],
   "source": [
    "null_view = ds.set_field(\n",
    "    \"uniqueness\",\n",
    "    (F(\"uniqueness\") >= 0.25).if_else(F(\"uniqueness\"), None)\n",
    ")\n",
    "\n",
    "# Create view that only contains samples with uniqueness = None\n",
    "not_unique_view = null_view.match(F(\"uniqueness\").is_null())\n",
    "\n",
    "print(len(not_unique_view))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "KAtrV1wlinn9"
   },
   "source": [
    "Because a FiftyOne `Dataset` can consist of samples of inhomogenous field schema, FiftyOne also provides the related methods, [exists()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.expressions.html?highlight=viewexpression#fiftyone.core.expressions.ViewExpression.exists), and its converse, [is_missing()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.expressions.html?highlight=viewexpression#fiftyone.core.expressions.ViewExpression.is_missing), which checks sample-wise if a field has a value."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_kiQl2yyeyfN"
   },
   "source": [
    "#### Array"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cvHOYHvbfA2f"
   },
   "source": [
    "In FiftyOne, fields can also contain arrays. We can check for this with the [is_array()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.expressions.html?highlight=viewexpression#fiftyone.core.expressions.ViewExpression.is_array) method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 153,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "DEZ9nw0nfTfu",
    "outputId": "29626b05-3cb4-4f68-82c6-5fe4000a927d"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "200"
      ]
     },
     "execution_count": 153,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds.match(F(\"tags\").is_array()).count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "d0HvTBuqh5BA"
   },
   "source": [
    "## Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HHure5thjobi"
   },
   "source": [
    "FiftyOne and pandas are both open source Python libraries that make dealing with your data easy. While they serve different purposes - pandas is built for tabular data, while FiftyOne helps users tackle the unstructured data prevalent in computer vision tasks - their syntax and functionality are closely aligned. Both pandas and FiftyOne are important components to many data science and machine learning workflows!"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [
    "V0AUY9QzNw0S",
    "5RmCwqPxOTLB"
   ],
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
